Failures & Postmortems
Public API postmortems — Knight Capital 2012, AWS S3 2017, Facebook + Uber outages. What the contract didn't enforce.
Every senior API designer should be able to walk through two or three public API postmortems. They teach what fails in production and what the contract didn't say. The three included here cover three different root causes: a software deploy gone wrong (Knight Capital), a typo in a maintenance command (AWS S3), and cascading dependency failures (Facebook, Uber).
Read the public postmortems linked from each writeup. The writeups here are summaries with the API-design lesson surfaced — what the contract should have enforced, what monitoring should have caught, what guardrails would have prevented the cascade.
Key concepts
- Most public API failures trace to a deploy or a config change — the API itself rarely fails on its own
- The five patterns: deploy bug, capacity, contract drift, dependency outage, cascading retries
- Public postmortems are the gold standard of operational learning
- Out-of-band access matters when your normal access requires the API you're fixing
- Cascading failure is the dominant production failure mode; design for blast radius
Reference template
// Reading an API postmortem
1. What was the trigger? (deploy? config? capacity? dependency?)
2. What broke first? (which contract failed first?)
3. What cascaded? (and why)
4. What did recovery look like? (rollback? cold start? manual intervention?)
5. What changed afterwards? (the API contract? the deploy pipeline? the monitoring?) Adapt to your problem; the structure is the load-bearing part.
Common pitfalls
- Treating 'a feature flag revived dead code' as exotic — it's the canonical deploy bug
- Underestimating the cost of a typo in a privileged command — AWS S3 was that
- Assuming dependent services will degrade gracefully — they won't, by default
- Optimising for the happy path and forgetting the recovery path
Related topics
Items (4)
- What Causes API Failures — A Taxonomy
Deploy bugs, capacity, contract drift, dependency outage, cascading retries. The five patterns that recur in every public postmortem.
Concept Foundational - Knight Capital 2012 — The 45-Minute $440M Bug
An un-removed feature flag (SMARS) revived dead code on one of eight servers. A trading-API contract that didn't enforce safety.
Postmortem Foundational - AWS S3 2017 — The us-east-1 Service Disruption
A typo in a maintenance command cascaded through the S3 control plane. The API's read path went down and took half the Internet.
Postmortem Foundational - Facebook and Uber API Outages — Patterns
Two short case studies of cascading API failures from major social and mobility APIs. What the contract didn't say.
Postmortem Foundational