Systems & Postmortems
Public network-layer postmortems — Cloudflare 2019 regex outage, Facebook 2021 BGP withdrawal, AWS us-east-1 S3 2017.
Every senior engineer should be able to walk through two or three public network postmortems. They teach what fails in production at scale and what the recovery patterns look like.
The three included here cover three different layers: a software bug in Cloudflare's WAF (application-level deploy), Facebook's BGP withdrawal (network layer + out-of-band access), and AWS S3's us-east-1 outage (cascading failure inside a single region). Read all three of the public postmortems; the writeups in this section are summaries with lessons.
Key concepts
- Routing protocols are control-plane state — corrupt that and you lose the data plane
- Out-of-band access matters when your normal access requires the network you're fixing
- Cascading failure is the dominant production failure mode — design for blast radius
- Deploy safety (canary, kill switches, rollback) is non-negotiable for global services
- Public postmortems are the gold standard of operational learning
Reference template
// Reading a network postmortem
1. What was the trigger? (deploy? config change? hardware?)
2. What broke first? (which layer? which contract?)
3. What cascaded? (and why)
4. What did recovery look like? (rollback? cold start? manual intervention?)
5. What changed afterwards? (technical + process) Adapt to your problem; the structure is the load-bearing part.
Common pitfalls
- Treating 'BGP withdrew our routes' as exotic — it's recurring and easy to do by accident
- Underestimating regex backtracking — Cloudflare's WAF wasn't unique
- Trusting backups and runbooks that haven't been tested in a year
- Optimising for the happy path and forgetting the recovery path
Related topics
Items (3)
- Cloudflare 2019 — The Regex Outage
A bad WAF regex with exponential backtracking; 30 minutes of global edge downtime; the postmortem on safe deploys.
Postmortem Foundational - Facebook 2021 — The BGP Withdrawal
A maintenance command withdrew BGP routes for Facebook's nameservers; 6+ hours of global outage; the lessons on out-of-band access.
Postmortem Foundational - AWS us-east-1 2017 — The S3 Outage
A misconfigured command in the S3 billing system; cascading failure through us-east-1; the lessons on blast radius.
Postmortem Foundational