Postmortems

Public incident postmortems from real outages at Facebook, AWS, Cloudflare, and others. The teacher you actually learn from is the system that broke last week.

4 items 2 Foundational 2 Intermediate

Every postmortem here is a public, well-documented incident from a hyperscaler. The lesson isn't "don't make this mistake" (you wouldn't have) — it's the *shape* of the failure: how a small change cascaded, what monitoring missed it, what the recovery looked like.

Read these as the antidote to designing in the absence of failure imagination. The best system-design answers in interviews proactively name a failure mode the interviewer was about to ask about.

Key concepts

Blast radius is set by the smallest unit of independent failure — measure it before you ship
Control planes that depend on the data plane (or vice versa) create unrecoverable deadlocks during outages
Cache-warm dependencies make 'just restart it' a multi-hour operation at scale
Capacity for failure scenarios is rarely the same as capacity for steady-state load
The recovery playbook needs to be runnable when the auth system is down — paper backups exist for a reason

Reference template

// Postmortem reading template
## Summary
## Timeline
## Root cause
## Cascading effects
## What was fixed
## Lessons that generalize
## Related material

Adapt to your problem; the structure is the load-bearing part.

Common pitfalls

Reading postmortems as horror stories rather than design exercises — the goal is to extract a transferable lesson
Underestimating the recovery path — 'we restored from backup' is rarely a complete answer at scale
Believing that big-tech engineers won't make small mistakes — they make exactly the same ones we do, with more zeros attached
Skipping the timeline — the *order* of cascading events is what teaches the lesson

Items (4)

Facebook / WhatsApp / Instagram — 2021 BGP outage
A routine BGP maintenance command withdrew Facebook from the internet. Six hours of global blackout; internal tools and badge readers locked out of the buildings that hosted the fix.

Postmortem Foundational
AWS Kinesis — 2020 us-east-1 outage
Thread-limit exhaustion cascaded across services that depended on Kinesis for control-plane operations.

Postmortem Intermediate
AWS us-east-1 — repeated cascade failures
Why one region keeps taking down half the internet, and what 'control plane in one region' really costs.

Postmortem Intermediate
Cloudflare — 2019 regex catastrophic backtracking
A single WAF rule with exponential regex backtracking burned 100% CPU across every edge node simultaneously.

Postmortem Foundational

Postmortems

Key concepts

Reference template

Common pitfalls

Related topics

Items (4)

Keyboard shortcuts