What Causes API Failures — A Taxonomy
Deploy bugs, capacity, contract drift, dependency outage, cascading retries. The five patterns that recur in every public postmortem.
Summary#
Read enough public postmortems and the patterns repeat. The names of the companies change. The technologies change. The blast radius changes by orders of magnitude. But the shape of the failure clusters into a small number of recurring patterns. Five of them cover most of the catalogue:
- Deploy bug — a change rolled out that should not have. The classic single-event trigger.
- Capacity overrun — demand exceeded what the system was provisioned for. The classic load event.
- Contract drift — clients and servers diverged. The classic versioning failure.
- Dependency outage — a downstream service you trusted went down. The classic blast-radius event.
- Cascading retries — a partial failure was amplified by aggressive retry policies. The classic self-inflicted wound.
These are not mutually exclusive. The most damaging outages of the last decade typically involve two or three of these stacked — a deploy bug that triggers a capacity overrun in a downstream service that triggers a cascading retry storm. But the patterns are individually recognisable, and naming them is the first step in designing against them.
The patterns matter to API designers because each one suggests a specific control that lives in the API contract: deploy safety needs kill-switches and atomic rollout; capacity needs rate limits and quotas; contract drift needs versioning discipline; dependency outage needs circuit breakers and degraded modes; cascading retries need exponential backoff with jitter and retry budgets. A good API contract has each control intentionally, not by accident.
Why it matters#
Three reasons the taxonomy is worth memorising:
- The five categories cover most public postmortems. Knight Capital 2012 is a deploy bug; AWS S3 2017 is a deploy bug that became a capacity overrun in the control plane; Facebook 2021 is a dependency outage (DNS) that became a cascade; Slack January 2021 is a cascading retry storm; nearly every “API returned wrong data after deploy” report is contract drift. Categorising the next one you read takes seconds, and the category tells you what to look for.
- The patterns map to design controls one-for-one. This is what makes the taxonomy useful in API design (as opposed to taxonomies that are only useful in postmortems). If you cannot list the control your API has for each of the five, you have a gap.
- Postmortem reading is a senior skill. The interviewer who asks “tell me about an API failure” is testing whether you have read postmortems and pattern-matched them. The answer “Knight Capital was a deploy bug, AWS S3 was a deploy bug plus capacity overrun in the metadata subsystem, Facebook was a dependency-of-the-control-plane outage” beats any single deeply-narrated example.
The senior-signal phrasing in an interview: “Most API failures are one of five patterns — deploy bug, capacity overrun, contract drift, dependency outage, cascading retries. The API contract has a specific defence for each one.”
How it works#
Pattern 1 — Deploy bug#
A change rolled out that should not have. The change might be code, configuration, a feature flag, a schema migration, or a control-plane operation. The single-event quality is what makes deploy bugs distinctive: there is a T-0 moment when the deploy happened, and the world shifted.
Sub-shapes:
- The change is correct but the rollout isn’t atomic. Knight Capital 2012 — the new SMARS binary was supposed to land on all eight servers; it landed on seven. The state where some boxes ran new code and one ran old code was the trigger.
- The change is correct but the validation isn’t sufficient. Many “schema migration broke production” stories — the migration worked in staging but staging didn’t have the row that triggered the bug.
- The change is itself incorrect. Knight’s real problem was a feature flag whose meaning had silently changed years earlier. The 2012 deploy revived dead code that should have been removed.
- The change is at the control plane. AWS S3 2017 — a maintenance command intended to take a few servers offline took offline a much larger fleet because the wildcard was wider than the operator intended.
Canonical case study: Knight Capital 2012. 45 minutes; $440M loss; the company acquired weeks later. The technical trigger was a manual eight-server deploy that landed on seven; the structural cause was an API contract that allowed catastrophic outcomes from a single misconfigured caller (no pre-trade limits, no kill-switch, no per-server quorum requirement). See the dedicated postmortem (knight-capital-2012-deployment-bug).
API-design controls for deploy bugs:
- Atomic rollouts. No half-deployed states; either all instances run the new version or none do.
- Feature flags with explicit deprecation. No silently-repurposed flags. Dead code is removed, not deactivated-and-left.
- Kill-switches as an API surface. A single endpoint that halts dangerous operations, reachable from a separate auth path so it works when the primary control plane is degraded.
- Canary deployments. A small percentage of traffic hits the new version first; metrics decide whether to proceed.
- Pre-flight validation in the API. Even after a deploy, the API itself enforces invariants (rate limits, position limits, schema validation) that catch a bad release before it can do damage.
Pattern 2 — Capacity overrun#
Demand exceeded what the system was provisioned to handle. The single-event quality is the load spike. The system’s response — degraded latency, dropped requests, autoscaling kicking in late or not at all — defines the blast radius.
Sub-shapes:
- Organic traffic spike. A celebrity tweet, a viral product launch, a press hit. Twitter’s “celebrity-tweet” pattern; Twitch with a major launch; HackerNews hugs of death.
- Cold-start avalanche. A region recovers from an outage; every client retries at once; the recovering region is hit harder than it was before the outage. (This overlaps with Pattern 5.)
- Adversarial load. DDoS, scraping, bot-fueled load tests against your API.
- Demand from your own front-end. A misconfigured retry loop in a mobile client release that ships to a million users.
- Capacity-of-a-bottleneck. A specific dependency (a database, a cache, a third-party API) saturates first; the rest of the system bottlenecks against it.
Canonical case study: Slack January 2021. A scaling event during a high-load January 4 traffic surge — internal monitoring traffic itself overloaded servers and degraded the AWS Transit Gateway capacity. The amplification came from internal services retrying against each other.
API-design controls for capacity:
- Rate limits at every layer. Per-IP at the edge, per-API-key at the gateway, per-user at the application, per-resource at the database. Limits documented in the contract; quota headers in the response.
- Quotas and concurrency caps. Not just request-rate but maximum concurrent operations; payload-size caps; query-complexity caps for GraphQL.
- Backpressure. When at capacity, return
429 Too Many Requestswith aRetry-Afterheader rather than queueing the request and timing out. - Autoscaling that isn’t reactive-only. Predictive scaling for known load patterns (weekly, daily, seasonal); pre-warmed capacity for advertised events.
- Load shedding. When the system is over capacity, shed lower-priority traffic to preserve the critical path. “Reject 10% of traffic immediately” beats “queue everything and time out.”
Pattern 3 — Contract drift#
Clients and servers diverged. The contract on the server changed; some clients didn’t update; the mismatch produced errors, wrong data, or silent corruption. The single-event quality is often subtle — a v2 of the API was deployed months ago; a client still calls v1 and gets a field it doesn’t recognise; that client started erroring last Tuesday for unrelated reasons.
Sub-shapes:
- Mobile clients lag behind the server. A 2-year-old mobile app still calls the old schema. The server quietly removed the field; the app crashes on null.
- Webhook payloads add fields. The receiver had strict schema validation on the parser; the new field fails parsing; webhooks all error out.
- Backward-incompatible change shipped as if backward-compatible. A field renamed, a type changed, a default value flipped. The senior version of the bug is “we marked this field optional in the docs, but our client requires it.”
- Auth contract changed. Tokens with a new claim format; old clients can’t parse them; auth failures.
- Error-envelope changed. The error vocabulary shifted; clients that branched on
error.code == "rate_limited"goterror.kind == "rate_limit_exceeded"and treated everything as unknown.
API-design controls for contract drift:
- Versioning discipline. URL-based (
/v1/,/v2/) or header-based (Accept: application/vnd.example.v2+json). New major version for breaking changes; deprecation timelines published. - Additive-by-default schemas. Adding optional fields is free; removing or repurposing fields requires a major version. JSON Schema, Protobuf, OpenAPI all enforce variants of this.
- Forward-compatibility for clients. Clients ignore unknown fields rather than rejecting them. Server doesn’t ship new mandatory fields without a version bump.
- Contract tests in CI. Pact-style consumer-driven tests catch a breaking change before it deploys; OpenAPI diff tooling alerts on schema regressions.
- Long deprecation windows. A field marked deprecated in v1 doesn’t disappear in v2; it disappears in v3, with
SunsetandDeprecationheaders warning clients in the meantime (RFC 8594, RFC 9745).
Pattern 4 — Dependency outage#
A downstream service you trusted went down. Your API can be perfectly written, perfectly deployed, perfectly sized, and still go down because the database it talks to went down. The single-event quality is the upstream incident; the blast-radius question is “what does your API do when the dependency fails?”
Sub-shapes:
- A cloud-provider service outage. AWS S3, AWS DynamoDB, Cloudflare, Google Cloud — these go down a few times per year; thousands of dependent APIs go down with them.
- A database failure. The primary fails over to a replica; the failover takes 90 seconds; everything that talks to that DB is degraded for those 90 seconds.
- A third-party API outage. Stripe, Twilio, SendGrid, Auth0 — if your API depends on theirs synchronously, you inherit their availability.
- A DNS outage. Less common but spectacular when it happens. Facebook October 2021 — BGP withdrawal of DNS server prefixes; everyone (including Facebook’s own internal tools) lost the ability to resolve facebook.com; recovery was hours.
- A cascading dependency failure. Your API depends on service A; service A depends on service B; B goes down; A goes down; you go down.
Canonical case study: AWS S3 2017. A typo in a maintenance command took offline a much larger subset of the S3 metadata subsystem than intended; thousands of public web services that depended on S3 for static assets went down with it. The fix took 4 hours. The lesson generalised: the blast radius of a dependency is exactly the set of consumers that have no degraded mode.
API-design controls for dependency outages:
- Circuit breakers. When the downstream is unhealthy, stop calling it for a window. The classic Hystrix / resilience4j pattern (see
circuit-breaker-pattern). - Timeouts everywhere. Every outbound call has a timeout shorter than your inbound SLA. A downstream that hangs is worse than one that fails fast.
- Bulkheads. Different downstreams use different thread pools / connection pools; one slow dependency doesn’t exhaust the resources for all the others.
- Degraded modes in the contract. Document what your API does when its dependency is down — return stale data, return
503with aRetry-After, return a partial response. The contract specifies the failure shape. - Fallback paths. Read-from-cache when the DB is down; serve a static error page when dynamic generation fails; allow read traffic when the write path is unavailable.
- Multi-region failover for hard-availability requirements; multi-vendor where the provider is itself the single point of failure.
Pattern 5 — Cascading retries#
A partial failure was amplified by aggressive retry policies until the system collapsed. The single-event quality is the retry storm: every client sees the same blip, every client retries, the retries land all at once, the recovering system is hit with more traffic than the original spike, the system stays down longer than the original incident.
Sub-shapes:
- The recovering-service kill. A service had a 30-second outage; on second 31, a million clients retry simultaneously; the recovering service can’t accept that load; it goes down again for another 5 minutes.
- Synchronous retry without jitter. All retrying clients hit at exactly the same time (
retry after 1 second, then 2 seconds, then 4 seconds). Without jitter, the spikes are synchronised. - Retries without budgets. A client retries indefinitely; one client can saturate the downstream by itself.
- Inter-service retries. Service A retries against B; B retries against C; one C failure produces N retries from B and N×M retries from A. The amplification is multiplicative.
- Webhook-storm. A webhook delivery fails; the publisher retries on exponential backoff; thousands of subscribers were affected by the same root cause; the publisher’s retry storm hits each subscriber.
Canonical case studies: numerous Twitter, Facebook, and Slack incidents. The Facebook October 2021 outage compounded a DNS failure (Pattern 4) with retry amplification (Pattern 5): once DNS came back, the recovery itself was slow because every internal client retried at full speed.
API-design controls for cascading retries:
- Exponential backoff with jitter. Mandatory in every retry library. The jitter (full-jitter, equal-jitter, decorrelated-jitter) is what desynchronises retries across clients.
- Retry budgets. A client retries at most
Ntimes per minute across all requests; once the budget is exhausted, fail fast rather than retry. Google SRE book formalised this. - Token-bucket retry limits. Token bucket regulates retry rate per client; refill rate governs steady-state retry pressure.
- Server-side
Retry-After. When the server is overloaded, return429or503withRetry-After: 30. Honest signal; clients that respect it spread out their retries. - Circuit breakers on the client side too. When a downstream is failing, stop retrying for a window; the client’s circuit breaker protects the server from amplification.
- No-retry on certain errors.
4xxresponses (except429) are deterministic; don’t retry. Retrying a400is a bug in the retry policy.
The five patterns on one slide#
| Pattern | Single-event trigger | Detection signal | API-design control |
|---|---|---|---|
| Deploy bug | A change rolled out | Error rate jumps at deploy time | Atomic rollout, kill-switch, canary |
| Capacity overrun | Load exceeded provision | Latency climbs; saturation alarms | Rate limits, autoscaling, load shedding |
| Contract drift | Server schema changed | Client errors on specific fields | Versioning, additive schemas, contract tests |
| Dependency outage | Downstream went down | Downstream’s error rate; your timeouts | Circuit breaker, bulkheads, degraded mode |
| Cascading retries | A blip + aggressive retries | Recovery is slower than the outage | Backoff + jitter, retry budgets, server-side Retry-After |
When two or three patterns stack#
Real outages stack patterns. The compounding is what turns a blip into a multi-hour incident:
- Knight Capital 2012: Pattern 1 (deploy bug) compounded by the absence of Pattern 2 controls (no pre-trade rate limits) and the absence of a kill-switch. The deploy bug alone would have been a non-event with the right rate limits.
- AWS S3 2017: Pattern 1 (deploy / control-plane command) triggered Pattern 2 (capacity overrun in the S3 metadata subsystem) because more servers were taken offline than intended.
- Facebook October 2021: Pattern 4 (DNS dependency outage) compounded by Pattern 5 (cascading retries during recovery) because every internal tool also depended on the DNS path.
- Slack January 2021: Pattern 5 (cascading retries from internal monitoring) caused Pattern 2 (capacity overrun) on AWS Transit Gateway, which became a Pattern 4 (dependency outage) for everything behind it.
The taxonomy is most useful as a checklist for “what could compound here.” A capacity overrun (Pattern 2) is recoverable with rate limits; the same overrun plus a retry storm (Pattern 5) is not.
Variants and trade-offs#
The five patterns are stable; the blend differs by API type.
| API type | Most common pattern | Why |
|---|---|---|
| Payments | Deploy bug | High-stakes writes; small changes have huge blast radius. |
| Public web API (CDN, search) | Capacity overrun | Demand is bursty and unbounded. |
| Mobile-backend API | Contract drift | Old clients in the wild for years. |
| Microservice mesh | Dependency outage + cascading retries | Many synchronous dependencies; multiplicative blast radius. |
| Webhook delivery | Cascading retries | Many subscribers, synchronised retry storms. |
This is not prescriptive — every API hits all five patterns at some point — but it shapes the design budget. Payments invests heavily in deploy safety; CDNs invest heavily in autoscaling; mobile backends invest heavily in versioning; meshes invest heavily in circuit breakers.
When this is asked in interviews#
“Tell me about a public API failure” is a standard interview prompt. The shallow answer narrates one outage in detail. The senior answer puts that outage on the taxonomy:
- Name the pattern explicitly. “Knight Capital was a deploy bug — Pattern 1 — but compounded by the absence of the rate-limit controls that should have been there.”
- Reference the API-design control. “The lesson for the API is that kill-switches need to be a first-class endpoint.”
- Acknowledge the stacking. “Most real outages are two or three of these stacked; the design defence is having a specific control for each pattern, not relying on one of them as a backstop.”
Specific points to make:
- The five-pattern taxonomy by name. Memorise it; deliver it cleanly.
- For each pattern, the canonical case study (Knight Capital, AWS S3 2017, mobile drift, Facebook 2021, Slack 2021). Don’t get the year wrong.
- For each pattern, the API-design control that prevents it. Tying the lesson to a contract-level decision is the senior signal.
- The compounding effect. No real outage is just one pattern; the design strategy is layered controls.
The strongest summary: “Five patterns cover most API failures — deploy bug, capacity overrun, contract drift, dependency outage, cascading retries — and the API contract has a specific control for each.”
Related concepts#
- Knight Capital 2012 — The 45-Minute $440M Bug — the canonical deploy-bug postmortem; Pattern 1 with the absence of Pattern 2 controls.
- AWS S3 2017 — The us-east-1 Service Disruption — Pattern 1 triggering Pattern 2 in the control plane; the dependency-outage view.
- Facebook and Uber API Outages — Patterns — Pattern 4 (Facebook DNS) and Pattern 5 (mobility cascades); two real-shaped incidents.
- Managing Retries — the discipline that prevents Pattern 5 from compounding everything else.
- The Circuit Breaker Pattern — the canonical defence against Pattern 4.