Postmortems
Public incident postmortems from real outages at Facebook, AWS, Cloudflare, and others. The teacher you actually learn from is the system that broke last week.
Every postmortem here is a public, well-documented incident from a hyperscaler. The lesson isn't "don't make this mistake" (you wouldn't have) — it's the *shape* of the failure: how a small change cascaded, what monitoring missed it, what the recovery looked like.
Read these as the antidote to designing in the absence of failure imagination. The best system-design answers in interviews proactively name a failure mode the interviewer was about to ask about.
Key concepts
- Blast radius is set by the smallest unit of independent failure — measure it before you ship
- Control planes that depend on the data plane (or vice versa) create unrecoverable deadlocks during outages
- Cache-warm dependencies make 'just restart it' a multi-hour operation at scale
- Capacity for failure scenarios is rarely the same as capacity for steady-state load
- The recovery playbook needs to be runnable when the auth system is down — paper backups exist for a reason
Reference template
// Postmortem reading template
## Summary
## Timeline
## Root cause
## Cascading effects
## What was fixed
## Lessons that generalize
## Related material Adapt to your problem; the structure is the load-bearing part.
Common pitfalls
- Reading postmortems as horror stories rather than design exercises — the goal is to extract a transferable lesson
- Underestimating the recovery path — 'we restored from backup' is rarely a complete answer at scale
- Believing that big-tech engineers won't make small mistakes — they make exactly the same ones we do, with more zeros attached
- Skipping the timeline — the *order* of cascading events is what teaches the lesson
Related topics
Items (4)
- Facebook / WhatsApp / Instagram — 2021 BGP outage
A routine BGP maintenance command withdrew Facebook from the internet. Six hours of global blackout; internal tools and badge readers locked out of the buildings that hosted the fix.
Postmortem Foundational - AWS Kinesis — 2020 us-east-1 outage
Thread-limit exhaustion cascaded across services that depended on Kinesis for control-plane operations.
Postmortem Intermediate - AWS us-east-1 — repeated cascade failures
Why one region keeps taking down half the internet, and what 'control plane in one region' really costs.
Postmortem Intermediate - Cloudflare — 2019 regex catastrophic backtracking
A single WAF rule with exponential regex backtracking burned 100% CPU across every edge node simultaneously.
Postmortem Foundational