AWS us-east-1 2017 — The S3 Outage — Computer Networks

What happened#

On 28 February 2017, between approximately 17:37 and 21:54 UTC, Amazon S3 in the us-east-1 region experienced a significant service disruption. The S3 API returned high error rates for nearly four hours, and a long list of dependent AWS services — including parts of EC2, EBS, Lambda, the AWS Service Health Dashboard itself, and countless customer applications — were degraded or unavailable for some or all of the window.

The trigger was a single typo in a maintenance command. An AWS engineer working on the S3 billing system intended to remove a small number of servers but specified a wrong argument that removed a much larger set, including capacity from two foundational S3 subsystems. Those subsystems had to be restarted, and the restart took longer than expected because they had not been fully restarted at this scale in years. The published AWS postmortem on aws.amazon.com/message/41926/ is the canonical source.

Context#

S3 in 2017 was already over a decade old and an internal foundation for much of the AWS service catalogue. EBS snapshots lived in S3. Lambda function code packages lived in S3. The Elastic Container Registry stored images in S3. CloudFront’s edge cache fell back to S3 origins. Glacier used S3 internally. Even the AWS Service Health Dashboard — the page customers checked to see whether AWS was up — used S3 to host its static assets.

S3 itself was structured as a small number of large internal subsystems, each handling one aspect of the service. Two relevant ones for this event:

The index subsystem — managed the metadata for every object: which bucket, which key, where the data lived, who could access it. Required for every GET, PUT, LIST, DELETE.
The placement subsystem — managed allocation of storage for new object writes.

Both ran on dedicated fleets that had been incrementally scaled up over years. They had not been fully restarted (cold-started from zero) at their current size for a long time. The startup-time properties at full scale were not well understood.

Trigger and propagation#

At 17:37 UTC, an authorised AWS engineer was debugging an issue with the S3 billing system, which charges customers for storage and requests. The engineer ran a playbook command intended to remove a small number of servers from one of the billing subsystems. The command’s input was specified incorrectly: instead of removing a few servers, it removed a much larger set — including servers in the index and placement subsystems.

Capacity in those subsystems dropped below the threshold required to serve customer requests. The index subsystem could no longer answer “where is this object?” for some fraction of requests; S3 API calls began returning 503 Service Unavailable. As the percentage of failing requests grew, dependent services started seeing failures of their own.

17:37  Bad command executed; capacity removed from index + placement subsystems
17:37-17:45  S3 API error rates spike; dependent services begin failing
17:45-18:30  AWS engineers diagnose; decide to fully restart the affected subsystems
18:30-21:00  Restart in progress
              The subsystems require a longer cold-start time than expected
              Engineers cannot use the AWS Service Health Dashboard to communicate
              because the Dashboard itself depends on S3 for asset hosting
21:00-21:54  S3 API errors return to baseline as subsystems complete startup
21:54  Full recovery announced

The cascade pattern was the second-order failure. EBS volume operations that needed to fetch snapshot metadata from S3 stalled. New Lambda functions could not start because their code packages were unreachable. EC2 instance launches that involved fetching AMIs or snapshot data hung. CloudFront’s S3-origin requests failed; cached assets continued serving but anything cache-miss hit S3 and failed.

The AWS Service Health Dashboard — the page customers consult during outages — was hosted on S3 in us-east-1. As S3 became degraded, the Dashboard itself struggled to update. AWS communicated via Twitter and the Personal Health Dashboard (which had a different infrastructure path) during the worst of the event.

Detection and response#

Detection was immediate. AWS’s internal monitoring saw the capacity drop within seconds. The challenge was that recovery required cold-starting subsystems that hadn’t been cold-started at this scale before.

When the index subsystem was instructed to restart, it had to perform an internal consistency check that scaled with the total number of objects in the system. The check time, at S3’s 2017 scale, was hours rather than the minutes it had been when the procedure was originally designed. Engineers had to wait for the subsystem to come back through its natural startup path; there was no shortcut.

Communication during the event was hampered by the Dashboard’s S3 dependency. AWS made the call to update via Twitter and via a separate Personal Health Dashboard path; in the postmortem they committed to moving the Service Health Dashboard off us-east-1 entirely so future events would not blind customers in the same way.

Root cause#

Three things had to be true at once:

A maintenance command accepted arguments without bounds-checking. The playbook allowed an engineer to specify an arbitrary number of servers to remove. There was no upper bound, no “are you sure you want to remove 5,000 of these?” prompt, no dry-run-first requirement.
The blast radius of a single command included multiple subsystems. The command operated across subsystems via a shared abstraction. Removing capacity from billing accidentally removed capacity from index and placement because the tool didn’t enforce subsystem isolation.
The subsystems had not been cold-started at scale in a long time. Their startup-time characteristics had grown without anyone noticing. Recovery time was therefore much longer than expected.

The bad command was the trigger. The architectural causes were the missing safety rails in the operations tooling and the unmaintained cold-start path.

Lessons and changes#

Per AWS’s published postmortem:

Safety rails added to playbook commands. Commands that remove capacity now require validation that the remaining capacity is sufficient. Removing too many servers at once is refused.
Subsystem isolation in operations tooling. Tools that previously could touch multiple S3 subsystems were reworked so a single command operates on one subsystem at a time.
Subsystem recovery faster. The index and placement subsystems were re-architected to support faster cold-start, partly via partitioning so a restart involves smaller cells.
Service Health Dashboard moved off us-east-1 S3. Hosting moved to multi-region with no shared-fate dependency on the service it reports on.
Operational drills for the long-tail recovery paths. Cold-starts that hadn’t been exercised in years were brought back into the regular drill cadence.

What it teaches in general#

Generalising:

Operator commands need the same safety as code deploys. Code goes through canary, soak, automated rollback. An ad-hoc CLI command run by a human often does not. Yet a CLI command can destroy capacity, withdraw routes, drop tables, delete data. The asymmetry between “deploy a binary” and “run a command” needs to close: both should require review, dry-run, and bounded blast radius.
Cold-start is a load-test surface most systems neglect. Production systems are warm 99.9% of the time. The 0.1% cold-start case happens during the worst possible moment — an outage. If your cold-start time has grown 10x over five years and nobody noticed, you find out during recovery. Periodic restart drills are the only way to keep cold-start honest.
Status dashboards must not depend on the services they report. The S3 outage blinded customers because the Service Health Dashboard ran on S3. The same architectural mistake recurs across the industry — status pages hosted on the same infrastructure they monitor. The fix is dull but mandatory: host the status dashboard somewhere with no shared dependency.
Region-level concentration creates correlated risk. Many customer architectures in 2017 ran entirely in us-east-1. When S3 there failed, those architectures failed. Multi-region deployments are expensive and complicated; running in a single region is the rational default until an event like this makes the cost of correlated failure visible. AWS has since added more guidance and tools for multi-region, but the trade-off persists.
Internal dependencies are wider than they look. S3 was the storage layer for EBS snapshots, Lambda packages, ECR images, CloudFront origins, and dozens of other AWS services. Most teams operating those services knew they depended on S3 in principle; few knew the exact failure modes. When you operate platform infrastructure inside a company, your blast radius is determined by every dependent — not by your direct customers.

Trigger asymmetry: a one-line typo’d command, a single regex with nested quantifiers, a single maintenance script that withdrew BGP. The proximate cause of every big outage is small.

Blast radius asymmetry: hours of degradation, dozens of dependent services, hundreds of customer companies, millions of end users affected. The consequence of every big outage is large.

Why does us-east-1 in particular keep showing up in postmortems?

Two reasons. First, us-east-1 is AWS’s oldest, largest, and most-used region. It serves a disproportionate fraction of total AWS traffic and hosts the global control plane for several services (IAM, Route 53, CloudFront, billing). When something breaks in us-east-1, the impact is bigger than in any other region. Second, that historical role means us-east-1 has the oldest infrastructure, the most accumulated technical debt, and the most operational shortcuts that were taken when the region was smaller. The combination — biggest blast radius plus oldest stack — explains why incidents disproportionately originate there.

The AWS postmortem at aws.amazon.com/message/41926/ is the canonical primary source.
Cloudflare 2019 — The Regex Outage — small trigger, large blast radius via a globally-deployed change.
Facebook 2021 — The BGP Withdrawal — small trigger, large blast radius via a network-layer command.
HTTP — Requests, Responses, Status Codes, Headers
What Is the Internet?