AWS S3 2017 — The us-east-1 Service Disruption — API Design

Context#

On 28 February 2017, between roughly 09:37 and 13:54 Pacific Time, Amazon S3 in the us-east-1 region — the oldest and busiest AWS region, the default for an enormous fraction of customer workloads — went down. For approximately four hours, GET, PUT, LIST, and DELETE requests against S3 objects in that region returned errors or timed out.

The blast radius reached far beyond AWS’s direct customers. Trello, Quora, Slack, Coursera, Imgur, Strava, GitHub status pages, parts of Medium, parts of Business Insider, parts of Salesforce — anything whose architecture assumed S3 in us-east-1 was always there — went down or degraded for hours. The episode became the canonical “the cloud is the new single point of failure” story.

The trigger was banal. The cascade was anything but. Amazon published a public post-event summary the same week (the “Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region”), and it remains one of the most-cited postmortems in distributed-systems literature.

This writeup focuses on the API-design lessons. The “what went wrong with the operational tooling” thread is rich, but the contract-level mistakes are what generalise.

What happened#

The trigger#

The S3 team was conducting routine work on the S3 billing system. The work involved removing a small number of capacity servers from one of S3’s many subsystems. An authorised engineer was running an established playbook. As part of running the playbook, the engineer issued a command intended to remove a small subset of servers from one of the subsystems.

A typo in the command’s input parameter caused a much larger set of servers to be removed than intended. The removed set spanned two critical S3 subsystems in us-east-1:

The index subsystem — which manages metadata for all S3 objects in the region. GET, LIST, PUT, and DELETE all depend on it. It is the source of truth for “what objects exist, where do their bytes live, what’s their state.”
The placement subsystem — which manages allocation of storage for new objects. PUT in particular cannot make forward progress without it.

With enough capacity removed from both, neither subsystem could serve requests. GET, PUT, LIST, DELETE all failed for any S3 object in us-east-1. New uploads couldn’t happen. Existing objects couldn’t be read. Lifecycle policies stalled.

Why recovery took four hours#

The S3 team identified the cause within minutes. The fix — restart the affected subsystems — should have been straightforward. It wasn’t. Two compounding factors:

The index subsystem had not been fully restarted in many years. S3 in us-east-1 had grown by orders of magnitude since the last cold start. The subsystem performs safety checks on its data structures at startup. With the much larger present-day data set, those checks took far longer than they had been designed for. The cold-start time was effectively untested at current scale.
The placement subsystem depends on the index subsystem being healthy. Until the index was back, placement could not be brought up. Even once both were running, S3 throttled incoming requests while replaying its work queue.

The result: a control-plane failure that, in principle, was “just restart it” took roughly four hours to resolve.

The cascade — the part that matters for API designers#

The post-event summary also disclosed a secondary failure that, in many ways, is the more memorable detail:

The AWS Service Health Dashboard — the status page customers consult to find out whether AWS is down — was hosted on S3 in us-east-1.

When S3 went down, so did the dashboard’s ability to update. AWS could not communicate the incident through its primary status channel for the first hour or so. Customers experiencing the outage went to the status page; the status page either failed to load or showed green; they assumed their own infrastructure was at fault and chased ghosts.

For external systems, the cascade pattern repeated at every layer:

Apps using S3 directly for object storage: dead.
Apps using S3 for static asset hosting (images, CSS, JS): partial page loads, broken UI.
Apps using a CDN backed by S3: CDN’s TTL’d content served fine; cache misses failed; a slow degradation as cache expired.
Apps using a service that uses S3 (Elastic Beanstalk, CloudWatch logs, SES configuration, even some EC2 launches): mysterious downstream failures that the customer couldn’t trace to a single root cause until AWS published.

Why this is an API-design postmortem#

The instinct on reading the S3 story is “this was an ops mistake — they ran a bad command.” That is correct but radically incomplete. The deeper failures are contract-level:

The maintenance API of the billing system did not require typed-out confirmation for a destructive parameter. A single typo expanded the blast radius from “small subset” to “two critical subsystems.” A well-designed admin API would have required the operator to type the count of affected servers and confirm if it exceeded a safety threshold.
The data-plane (GET/PUT) API and the control-plane (index/placement) API were not isolated by blast radius. A failure in the control plane took down the data plane immediately, instead of letting the data plane serve cached results until the control plane recovered.
The status page violated the dependency direction. A status page exists precisely to communicate when the system it monitors is down. If the status page depends on that system, the dependency points the wrong way. Out-of-band hosting is an API-level requirement of any status surface.
Cold-start time was an unstated part of the API’s recovery contract. If your customers’ RTO depends on your service coming back within an hour, your cold-start budget is an hour. If you haven’t measured it lately, you don’t know what you’ve promised.
There was no graceful-degradation path on the read API. GET could have served cached metadata for popular objects while the index subsystem rebooted. Instead it returned hard errors. Read availability and write availability could have been decoupled — they weren’t.

Every one of those is an API-design decision. The cold-start surprise and the typo are operational failures; the shape of the contract that turned them into a four-hour, half-the-internet outage is a design one.

The fix (technical and institutional)#

What AWS did#

AWS’s published remediation, paraphrased:

Maintenance tooling for the billing subsystem was modified so that removing capacity could not exceed a minimum safe operating threshold. The same affordance was rolled out to other S3 subsystems’ maintenance scripts.
Faster recovery of key S3 subsystems: the index and placement subsystems were rearchitected to partition their state into smaller cells, so a recovery only ever has to restart one cell at a time. The blast radius of any single cell’s failure is now bounded.
The Service Health Dashboard was moved to a multi-region architecture that no longer depends on us-east-1 S3 specifically. Status communication now survives the failure of any single region’s storage service.
Periodic exercises of cold-start paths. Subsystems that haven’t been cold-started in production are now exercised in staging at current scale, so cold-start time is a known, monitored number rather than a surprise.

The wider industry response#

S3 in 2017 had the same effect on cloud architecture that Knight Capital had on trading deploys five years earlier:

“Multi-region by default” moved from advanced-architecture pattern to baseline-prudent for any service with a meaningful uptime SLO. AWS itself rolled out additional regional independence for S3 features in subsequent years.
“Status pages out of band” became a hard rule. Atlassian’s Statuspage product (and competitors) won enterprise mindshare in part because their selling point was “we host it, so when your stuff is down, your status page still works.”
“Cell-based architecture” as a discipline for limiting blast radius gained widespread adoption. The S3 redesign mirrored what AWS DynamoDB and AWS Lambda had already been doing internally; outside AWS, papers like Werner Vogels’s writing on cellular architecture became reference material.
“Test the cold start” entered the operational vocabulary. Chaos engineering tools (Chaos Monkey and its successors) added cold-restart exercises to their scenario sets.

Lessons for API designers#

The S3 disruption generalises beyond AWS. Five lessons any API designer should carry, and which the API-design walk-through references when planning operational concerns:

Destructive admin APIs need confirmation gates. Any maintenance endpoint whose effect can exceed a safe threshold should require typed confirmation, a magic-word parameter, or a two-person rule. The cost is seconds of operator time; the benefit is converting a typo from “outage” into “command rejected.”
Read availability and write availability are different SLOs and should be designed as such. A read-only degraded mode that serves last-known-good metadata is almost always better than a hard error. Decouple the contracts.
Status pages, monitoring, and incident communication must be out-of-band. Anything that says “this service is down” must run on infrastructure independent of “this service.” The dependency direction matters; this is not optional.
Cold-start time is part of your API’s contract. If you advertise an RTO of an hour, your cold start fits inside an hour, exercised regularly, with the test result on a dashboard. An untested cold start is a guess.
Limit blast radius at the API layer. Cells, regions, AZs, shards — whichever boundary fits. A single typo, a single deploy, a single misconfigured client should be unable to take out the whole service. If it can, the boundary is the API designer’s failure, not the operator’s.

When this is asked in interviews#

S3 2017 is the canonical “tell me about a cascading cloud failure” question in any infrastructure, platform, or storage-API interview. The senior-signal version of the answer:

Identify the technical trigger (typo in maintenance command, removed too many servers) in one sentence.
Identify the structural failure (control plane took down data plane; status page depended on the service it was meant to monitor) in one sentence.
Name three of the five lessons (confirmation gates, isolation, out-of-band status, cold-start budget, blast radius).
Apply one of those lessons concretely to the system the interviewer just had you design.

The junior version stops at “they should have type-checked the parameter.” That’s true but is one of five lessons. The interviewer is checking whether you see the API contract — not just the bug.

What Causes API Failures — A Taxonomy — the failure taxonomy this postmortem sits inside (cascading control-plane failures are one of the recurring patterns).
Knight Capital 2012 — The 45-Minute $440M Bug — a different cascade: dormant code revived by a flag, no kill-switch at the API layer.
Facebook and Uber API Outages — Patterns — two more patterns in the cascading-API family.
The Circuit Breaker Pattern — the building block that limits the blast radius of an upstream failure inside a single service.
Managing Retries — the client-side discipline that decides whether a recovering service stabilises or gets retried back into the floor.