← All system designs

Postmortems

Public incident postmortems from real outages at Facebook, AWS, Cloudflare, and others. The teacher you actually learn from is the system that broke last week.

4 items 2 Foundational 2 Intermediate

Every postmortem here is a public, well-documented incident from a hyperscaler. The lesson isn't "don't make this mistake" (you wouldn't have) — it's the *shape* of the failure: how a small change cascaded, what monitoring missed it, what the recovery looked like.

Read these as the antidote to designing in the absence of failure imagination. The best system-design answers in interviews proactively name a failure mode the interviewer was about to ask about.

Key concepts

  • Blast radius is set by the smallest unit of independent failure — measure it before you ship
  • Control planes that depend on the data plane (or vice versa) create unrecoverable deadlocks during outages
  • Cache-warm dependencies make 'just restart it' a multi-hour operation at scale
  • Capacity for failure scenarios is rarely the same as capacity for steady-state load
  • The recovery playbook needs to be runnable when the auth system is down — paper backups exist for a reason

Reference template

// Postmortem reading template
## Summary
## Timeline
## Root cause
## Cascading effects
## What was fixed
## Lessons that generalize
## Related material

Adapt to your problem; the structure is the load-bearing part.

Common pitfalls

  • Reading postmortems as horror stories rather than design exercises — the goal is to extract a transferable lesson
  • Underestimating the recovery path — 'we restored from backup' is rarely a complete answer at scale
  • Believing that big-tech engineers won't make small mistakes — they make exactly the same ones we do, with more zeros attached
  • Skipping the timeline — the *order* of cascading events is what teaches the lesson

Related topics

Items (4)

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.