Systems & Postmortems

Public network-layer postmortems — Cloudflare 2019 regex outage, Facebook 2021 BGP withdrawal, AWS us-east-1 S3 2017.

3 items 3 Foundational

Every senior engineer should be able to walk through two or three public network postmortems. They teach what fails in production at scale and what the recovery patterns look like.

The three included here cover three different layers: a software bug in Cloudflare's WAF (application-level deploy), Facebook's BGP withdrawal (network layer + out-of-band access), and AWS S3's us-east-1 outage (cascading failure inside a single region). Read all three of the public postmortems; the writeups in this section are summaries with lessons.

Key concepts

Routing protocols are control-plane state — corrupt that and you lose the data plane
Out-of-band access matters when your normal access requires the network you're fixing
Cascading failure is the dominant production failure mode — design for blast radius
Deploy safety (canary, kill switches, rollback) is non-negotiable for global services
Public postmortems are the gold standard of operational learning

Reference template

// Reading a network postmortem
1. What was the trigger?      (deploy? config change? hardware?)
2. What broke first?          (which layer? which contract?)
3. What cascaded?             (and why)
4. What did recovery look like? (rollback? cold start? manual intervention?)
5. What changed afterwards?   (technical + process)

Adapt to your problem; the structure is the load-bearing part.

Common pitfalls

Treating 'BGP withdrew our routes' as exotic — it's recurring and easy to do by accident
Underestimating regex backtracking — Cloudflare's WAF wasn't unique
Trusting backups and runbooks that haven't been tested in a year
Optimising for the happy path and forgetting the recovery path

Items (3)

Cloudflare 2019 — The Regex Outage
A bad WAF regex with exponential backtracking; 30 minutes of global edge downtime; the postmortem on safe deploys.

Postmortem Foundational
Facebook 2021 — The BGP Withdrawal
A maintenance command withdrew BGP routes for Facebook's nameservers; 6+ hours of global outage; the lessons on out-of-band access.

Postmortem Foundational
AWS us-east-1 2017 — The S3 Outage
A misconfigured command in the S3 billing system; cascading failure through us-east-1; the lessons on blast radius.

Postmortem Foundational

Systems & Postmortems

Key concepts

Reference template

Common pitfalls

Related topics

Items (3)

Keyboard shortcuts