← All system designs

Systems & Postmortems

Public network-layer postmortems — Cloudflare 2019 regex outage, Facebook 2021 BGP withdrawal, AWS us-east-1 S3 2017.

3 items 3 Foundational

Every senior engineer should be able to walk through two or three public network postmortems. They teach what fails in production at scale and what the recovery patterns look like.

The three included here cover three different layers: a software bug in Cloudflare's WAF (application-level deploy), Facebook's BGP withdrawal (network layer + out-of-band access), and AWS S3's us-east-1 outage (cascading failure inside a single region). Read all three of the public postmortems; the writeups in this section are summaries with lessons.

Key concepts

  • Routing protocols are control-plane state — corrupt that and you lose the data plane
  • Out-of-band access matters when your normal access requires the network you're fixing
  • Cascading failure is the dominant production failure mode — design for blast radius
  • Deploy safety (canary, kill switches, rollback) is non-negotiable for global services
  • Public postmortems are the gold standard of operational learning

Reference template

// Reading a network postmortem
1. What was the trigger?      (deploy? config change? hardware?)
2. What broke first?          (which layer? which contract?)
3. What cascaded?             (and why)
4. What did recovery look like? (rollback? cold start? manual intervention?)
5. What changed afterwards?   (technical + process)

Adapt to your problem; the structure is the load-bearing part.

Common pitfalls

  • Treating 'BGP withdrew our routes' as exotic — it's recurring and easy to do by accident
  • Underestimating regex backtracking — Cloudflare's WAF wasn't unique
  • Trusting backups and runbooks that haven't been tested in a year
  • Optimising for the happy path and forgetting the recovery path

Related topics

Items (3)

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.