Facebook and Uber API Outages — Patterns — API Design

Context#

Two short case studies in one writeup. Different domains, different triggers, same underlying lesson: APIs serving very large client bases have failure modes that look fundamentally different from APIs with a few clients.

The first is Facebook’s six-hour global outage on 4 October 2021 — a configuration change to the backbone network that disconnected Facebook’s nameservers from the public Internet, taking Facebook.com, Instagram, WhatsApp, and Messenger offline simultaneously. The second is a pattern from Uber’s API, where mobile-client retry storms have repeatedly prevented recovering services from stabilising.

Neither is a story about a bug in the application code. Both are stories about what the API contract didn’t say — specifically, about the management plane, about retry discipline, and about the difference between a recoverable failure and an unrecoverable one when there are billions of clients in the wild.

The Facebook event has a public post-incident write-up from Meta’s engineering blog. The Uber pattern is observable across multiple Uber outages over the years and is also discussed openly in their engineering posts about regional failover and load shedding.

What happened#

Facebook, 4 October 2021#

The trigger was an internal audit. Facebook’s backbone network — the long-haul fibre that connects the company’s data centres — was being assessed for capacity. As part of that work, an engineer issued a configuration command intended to evaluate the impact of taking a portion of the backbone offline. The command-validation tool that should have flagged the command’s actual blast radius had a bug; it did not.

The command disconnected the backbone. With the backbone gone, Facebook’s data centres could not reach each other.

The cascade unfolded in stages:

The DNS servers — Facebook’s authoritative nameservers for facebook.com, instagram.com, whatsapp.net, and fbcdn.net — were hosted on the same infrastructure. The nameservers had a health-checked mechanism: if they could not reach the rest of Facebook’s network, they withdrew their BGP route advertisements from the public Internet. The intent was sensible — don’t accept DNS queries you can’t usefully answer. The effect was catastrophic: Facebook’s DNS infrastructure now appeared, to the rest of the Internet, not to exist.
With BGP routes withdrawn, no public DNS resolver on Earth could find Facebook’s nameservers. facebook.com became unresolvable. Every Facebook product simultaneously failed for everyone, everywhere.
The recovery was hampered by the management plane being on the same network. The teams that needed to physically access data centres to repair the backbone configuration found that their badge readers and their internal authentication tools depended on the same network. The out-of-band path that a recovery needs — the way you reach the broken thing — was broken in the same incident.

Total outage: approximately six hours. Recovery required physical-presence intervention at data centres. The follow-on traffic spike when service returned — billions of clients retrying everything they had been blocked from — caused additional load issues but did not extend the headline outage.

The shape of the failure is what’s instructive. The configuration API and the management plane API were tied to the data plane in their failure modes. The contract didn’t say “this configuration change can disconnect us from the world”; it didn’t enforce that the DNS withdrawal behaviour required a sanity check before activating; it didn’t preserve a working out-of-band path.

Uber, the pattern (2016 onward)#

Uber’s published engineering writing describes a pattern that has recurred in slightly different forms across multiple incidents: a service in the request path develops elevated error rates or latency, and the mobile client’s retry behaviour amplifies the problem until the service cannot recover under load.

The pattern in detail:

A backend service — pricing, dispatching, ride-state — begins to fail or slow down. The cause varies: a database hot spot, a deploy with a regression, an upstream dependency outage.
The mobile app retries. The user taps “request a ride,” the request times out or returns an error, the app retries — sometimes silently, sometimes after user re-tap. If the user keeps tapping, the rate multiplies.
The aggregate retry rate across millions of clients far exceeds the original traffic rate that caused the problem. When the backend service starts recovering — even if it would otherwise be healthy now — it is hit with a retry storm that drops it back over the cliff.
The service oscillates between sick and recovering for an extended period. Each “recovery” lasts seconds before the retry load takes it down again.

This is not a bug in the service. It is a failure mode of the API contract between the mobile app and the backend — specifically, of the retry discipline that the API does not enforce. The contract permits aggressive retry; the client implements it; the result is observable as cascading failure.

Why these are API-design postmortems#

Two different patterns, same family. Each illustrates how the API contract — what it does and doesn’t say, what it does and doesn’t enforce — is the difference between a recoverable failure and a catastrophic one when there are very many clients.

Facebook (management plane coupling)

The failure mode is that the management plane shared infrastructure with the data plane. When the data plane failed, the management plane could not be used to fix it. The API contract did not preserve an out-of-band recovery path.

The lesson: the configuration API must remain reachable even when everything it configures is broken.

Uber (client retry amplification)

The failure mode is that the API contract permits aggressive retry from many clients, and the aggregate of those retries can prevent recovery. The contract did not require backoff or honour Retry-After.

The lesson: the server must specify, and the client SDK must enforce, exponential backoff with jitter.

Both are about what the contract didn’t say. Both are invisible until scale exposes them.

The fix (technical and institutional)#

Facebook’s remediation#

Meta’s published post-event remediation, paraphrased:

Configuration validation strengthened. The tooling that issues backbone changes now requires explicit confirmation when a command exceeds a blast-radius threshold; the validation tool that had the bug was fixed and double-verified.
DNS withdrawal logic decoupled from internal reachability. The behaviour where nameservers withdrew BGP routes on losing internal connectivity was reviewed; the trigger conditions tightened so transient internal issues don’t immediately propagate outward.
Out-of-band recovery paths. Physical-access systems and recovery tooling were given independent paths that don’t depend on the backbone being healthy. The “we can’t get into the data centre to fix it” failure mode was addressed at the procedural and physical level.

Uber’s remediation pattern (across incidents)#

Uber’s engineering writing describes a layered response that has been refined over years:

Client SDK retry discipline. The mobile SDK was changed to implement exponential backoff with jitter, honour Retry-After headers, and cap maximum retry frequency. Errors are surfaced to the user with an appropriate retry budget rather than silently amplified.
Server-side circuit breakers and load shedding. Backend services trip a circuit breaker when error rates exceed a threshold; once tripped, the breaker returns fast errors with Retry-After: <seconds> headers that the SDK respects. Recovery happens because incoming load is shed automatically, not because the service heroically out-paces it.
Regional failover. Stateless services run in multiple regions; failover routes traffic away from a sick region. The fast-fail with Retry-After is what makes failover work — clients honour it and stop hitting the broken region within a window.

The wider industry response#

These two incident families — and similar ones at other large-fleet APIs — informed several de facto standards:

Retry-After on every retriable error. Modern HTTP API designs treat the Retry-After header as a hard requirement on 429, 503, and circuit-breaker-tripped responses. A bare 503 without a hint is now considered a contract bug.
Backoff with jitter as a standard SDK behaviour. AWS’s SDKs, Google’s SDKs, modern HTTP libraries — all ship exponential backoff with jitter as the default for retriable errors. Manual for retry in range(3): time.sleep(1) loops are a code-review smell.
Out-of-band management surfaces. Major cloud providers treat “control plane resilience independent of data plane” as a hard architectural rule. Status pages, configuration consoles, IAM systems run on infrastructure decoupled from the products they manage.
Configuration changes require sanity gates. The combination of audit logging, two-person rules for high-blast-radius changes, automated validation of “this change reduces capacity by X% — confirm?” is now table stakes for any control-plane API.

Lessons for API designers#

The shared lesson is large-fleet failure dynamics. Specifically:

The configuration API is a first-class API. Its failure modes must not depend on the data plane recovering. If you can’t change configuration when the system is down, you can’t fix it. Out-of-band reachability is not a nice-to-have; it is the recovery contract.
Retry discipline is a server responsibility, not just a client one. Servers must specify retry behaviour (Retry-After, hints about budget) and circuit-break under load. Clients must honour those hints. If the contract leaves retry to client discretion, large fleets will choose the wrong default and the recovery curve will be a sawtooth.
Health checks need to ask “should I be reachable” carefully. Facebook’s nameservers withdrew BGP routes because they couldn’t reach internal systems. That heuristic was correct for most failure modes and disastrous for one. Health checks that change external visibility are a control-plane decision; their conditions should be reviewed at the contract level, not just inside the service code.
Scale changes the contract. An API with 100 clients can be sloppy about retry guidance and recover anyway. An API with 100 million mobile clients cannot. The same contract is fine in one regime and lethal in the other. Design for the scale you will reach, not the scale you have today.
Out-of-band always means independent of the failure domain. Status pages on the service they monitor, recovery tools on the network they fix, on-call paging through the cloud provider you depend on — all anti-patterns. The dependency direction has to point outward.

When this is asked in interviews#

The “name a public API failure” prompt loves either Facebook 2021 or the Uber retry pattern because they hit two different parts of the contract surface. The senior-signal answers:

For Facebook 2021:

Identify the trigger (BGP withdrawal due to backbone configuration command) in one sentence.
Identify the structural failure (management plane shared a failure domain with data plane; the recovery path itself was broken) in one sentence.
Name the lesson: out-of-band control plane, independent of the system being controlled.
Connect to a system the interviewer just walked through — “in this design, the admin console runs on the same load balancer as the user-facing API; that’s a Facebook-2021 pattern; here’s how I’d separate them.”

For the Uber retry pattern:

Identify the trigger (client-side aggressive retry, no server-side hint, recovery storm).
Identify the structural failure (retry discipline left to client default; no Retry-After; no circuit breaker).
Name the lesson: server specifies, SDK enforces, backoff with jitter, honour Retry-After.
Connect: “in this design, the mobile app retries on 5xx without backoff; that’s an Uber-pattern bug; here’s the contract I’d add.”

The junior version of either answer stops at “they should have tested more.” Both incidents had testing. The interviewer is checking whether you can see the contract-level failure, not the operational one.

What Causes API Failures — A Taxonomy — the failure taxonomy these two postmortems sit inside; control-plane coupling and retry amplification are two of the recurring patterns.
Knight Capital 2012 — The 45-Minute $440M Bug — another cascade where the contract didn’t enforce safety, this time in trading.
AWS S3 2017 — The us-east-1 Service Disruption — the third of the canonical big-three cascades; status-page dependency mirrors Facebook’s management-plane coupling exactly.
The Circuit Breaker Pattern — the server-side half of the retry-storm fix.
Managing Retries — the client-side half; the two have to be designed together.