AWS us-east-1 — repeated cascade failures — System Design

Summary#

us-east-1 (Northern Virginia) is AWS’s oldest and largest region. It is also the region most often associated with multi-service AWS outages. The same architectural pattern repeats: a fault in one us-east-1 service cascades into many other us-east-1 services, and into the global control planes hosted there — taking down customer workloads in entirely different regions.

This writeup is less about one date and more about the pattern. Major us-east-1 events include the 2017 S3 outage, the 2020 Kinesis outage, multiple Direct Connect events, the 2021 Fargate / EC2 networking event, and several smaller cascades through 2023-2024. The shared lesson: regional concentration of control planes turns a single region’s reliability into the whole platform’s reliability ceiling.

Timeline#

Rather than one date, here’s a representative composite of how these events tend to unfold. Times are illustrative; specific events vary in details.

T+00:00 — a subsystem in us-east-1 (DNS resolver, IAM cache, Kinesis fleet, control-plane database) hits an operational issue. Symptoms: elevated error rates for a single AWS service.
T+00:05 — the failing subsystem is a dependency of many other AWS services. Their control planes (the parts that create, modify, or describe resources) begin to fail, even though their data planes (the parts that route customer traffic) are healthy.
T+00:15 — customers in other regions begin to see issues. Console sign-in fails (IAM control plane in us-east-1 is unhealthy). New EC2 instances can’t launch (EC2 control plane in us-east-1 is unhealthy). DynamoDB Global Tables in other regions can’t propagate config changes.
T+00:30 — AWS Status Dashboard starts showing issues; updates lag the actual failures because the dashboard itself is hosted with dependencies on the affected services.
T+01:00 to T+04:00 — engineers mitigate by reducing load on the failing subsystem, rerouting traffic, restoring caches. Direct fixes to the trigger take longer.
T+04:00 to T+12:00 — full recovery, with backlogs draining in dependent services.
T+1 week — public postmortem published. Common themes: a software bug or capacity edge case as the trigger; the depth of cascading control-plane dependencies as the explanation for blast radius.

Root cause#

There is no single root cause across these events — each has its own trigger (a DNS server bug, a thread-limit hit, a deployment regression, a configuration push, an undocumented dependency). What’s shared is the structural root cause:

Many AWS global services have their control plane physically hosted in us-east-1. This is partly historical (us-east-1 is the oldest region, so things were built there first), partly architectural (some services use Route 53, IAM, or other us-east-1-hosted dependencies for cross-region coordination), and partly economic (running a global control plane in one region is cheaper than running it active-active across multiple).

So when us-east-1 is unhealthy:

IAM operations may fail or slow worldwide (sign-in, role assumption, policy changes).
Route 53 DNS changes can lag (the control plane is in us-east-1; the data plane serving DNS queries is global, so existing DNS records still resolve, but changes don’t propagate).
AWS Marketplace, AWS Organizations, Billing, and IAM Identity Center all run with us-east-1-centric control planes.
CloudFront configuration changes can be slow or fail.
S3 has multiple feature dependencies that historically touched us-east-1.

The data plane of each affected service in other regions usually keeps working (existing resources stay reachable; existing rules stay enforced). It’s the change path that fails — but in a fast-moving incident, “I can’t change anything” feels like an outage.

Cascading effects#

When us-east-1 has a bad day, the failure spreads in predictable ways:

Customer console operations break globally. Even customers running purely in eu-west-1 can’t log into the AWS console if IAM sign-in is dependent on us-east-1.
CI/CD pipelines stall. Deploy systems that need to assume IAM roles, push to ECR, or modify Lambda configurations all break.
Auto-scaling slows or fails. Some scaling actions trigger control-plane calls; if those fail, scaling stops happening even though existing capacity continues serving.
DNS-based failover doesn’t work. A common multi-region pattern is “failover via Route 53 health checks”. If Route 53’s control plane is impaired, failover changes can’t be applied.
Third-party SaaS that depends on AWS sees compounding failures. A SaaS in us-east-1 fails directly; a SaaS in us-west-2 that calls AWS APIs in its dependency chain fails through the control plane.
The AWS Status Page lags. Customers refresh it expecting updates and see “all services operating normally” for tens of minutes after the incident started, because the page’s data sources are themselves affected.

The blast radius is often described as “half the internet” in news coverage. That’s not quite literal — but Stripe, Slack, Coinbase, Snapchat, and many other widely-used services have us-east-1 dependencies that show through during these events.

What was fixed#

Each individual event produces specific fixes (a thread limit raised, a code path made independent, a cache pre-warmed). The structural fixes that AWS has gradually rolled out over many years:

Regional independence for control planes where feasible. Some services that originally had us-east-1-only control planes have been moved to a per-region or multi-region model.
Static stability as an explicit design principle: data planes are designed to continue operating even when control planes are unavailable. New resources can’t be created during a control-plane outage, but existing resources continue serving.
Dependency hygiene internally: cross-team work to eliminate undocumented dependencies on us-east-1 services. This is a long-running effort; some dependencies remain.
Public-facing status page hardened to be hostable independently of any one region’s stack.
Customer-facing recommendations to architect multi-region from the start, including the operational reality that the control plane is the often-overlooked single-region dependency.

Customers, separately, have learned to:

Treat us-east-1 as the canary region for new resources, not the production-only region.
Architect for “us-east-1 control plane is down” as a real failure mode, not a hypothetical.
Avoid hard dependencies on us-east-1-only AWS services for production workloads, even when running in other regions.

Lessons that generalize#

Concretely:

The control plane and the data plane have different failure modes and require different design. A data plane can be made highly available via replication and caching. A control plane is often centralized for consistency reasons and is therefore harder to replicate. This asymmetry is not an accident — it’s a fundamental architectural choice — but it must be acknowledged and budgeted.
“Failover to another region” depends on the failover mechanism working in the failure mode. A Route 53 health-check-driven failover that depends on the Route 53 control plane being healthy is a circular dependency. The same anti-pattern appears in Kubernetes (control plane in one zone), database failovers (consensus depends on the network the database is in), and CDN configurations. Test the failover under the actual failure conditions.
Static stability is the design pattern. When the control plane is down, the data plane should continue serving what it was already serving. Auto-scaling should preserve last-known-good capacity. Failovers should not require fresh control-plane operations to execute. The pattern is “design for the control plane to be down and the data plane to keep working” — and it requires explicit engineering, not hope.
Regional concentration is invisible until it bites. Many companies discovered during a us-east-1 outage that they had “multi-region” architectures with hidden us-east-1 dependencies — DNS configurations, IAM users, S3 buckets, CloudFront distributions. The dependency map needs explicit auditing; assuming “we run in multiple regions” doesn’t make it so.
Vendor reliability is a function of operational practice, not just architecture. AWS’s individual services are exceptionally well-engineered. The us-east-1 events are mostly about operational practices: capacity changes without staged rollouts, internal dependencies without explicit contracts, and the historical sediment of building things in one region for years.

A subtler lesson: at scale, your blast radius is determined by your weakest dependency’s blast radius. A 99.99%-available service that hard-depends on a 99.9%-available control plane is, in expectation, a 99.9%-available service. Pick your dependencies’ SLAs as carefully as you set your own.

AWS’s various postmortems on individual us-east-1 events are the primary source material.
AWS Kinesis — 2020 us-east-1 outage — a specific event that illustrates the pattern.
Facebook / WhatsApp / Instagram — 2021 BGP outage — different trigger, same recovery-depends-on-the-broken-system pattern.
Availability — the math behind “weakest dependency’s SLA dominates”.
Fault Tolerance — static stability as a design pattern.