AWS Kinesis — 2020 us-east-1 outage

Thread-limit exhaustion cascaded across services that depended on Kinesis for control-plane operations.

Postmortem Intermediate
6 min read
aws cascading-failure control-plane thread-pool

Summary#

On 25 November 2020, AWS Kinesis Data Streams in the us-east-1 region failed for ~17 hours. The trigger was a routine capacity-addition: front-end servers were added to a Kinesis sub-system, which pushed total thread counts on each front-end process past the operating system’s per-process thread limit.

The Kinesis front-end couldn’t accept connections; downstream AWS services that depended on Kinesis for their own internal operations cascaded into failure. CloudWatch, Cognito, EventBridge, AWS Lambda invocations from event sources, and a long list of others were degraded or unavailable.

The event is a canonical example of a small, internal-only operational change cascading into a region-wide failure mode because shared infrastructure that hosts the control plane of many services has a blast radius larger than any single service.

Timeline#

All times UTC. Reconstructed from AWS’s own postmortem published days after the event.

  • 09:15 (approx.) — additional front-end capacity is added to the Kinesis Data Streams service in us-east-1, as part of normal scaling.
  • ~10:00 — engineers begin investigating elevated errors on Kinesis Data Streams. The new front-ends are unhealthy; existing front-ends start experiencing thread-creation failures.
  • ~12:30 — CloudWatch metric ingestion (which uses Kinesis internally) begins to degrade. Customers running CloudWatch dashboards see stale or missing data.
  • ~13:00 — Cognito (which uses Kinesis for user-activity log ingestion) starts returning errors. Customer authentication flows on services using Cognito break.
  • ~14:00 — Lambda invocations triggered by Kinesis or DynamoDB streams accumulate backpressure; many fail with throttling errors.
  • ~17:00-22:00 — AWS engineers reduce the number of front-end threads per server (by reducing shard fanout per process) and bring servers back online in stages. Recovery is delayed by the need to drain backlogs in dependent services.
  • 02:30 next day — full recovery announced; all dependent services back to normal.

Root cause#

The Kinesis front-end fleet processes streams by sharding work across many threads — one (or a few) per shard the front-end is responsible for. When the fleet grew, each existing front-end opened new connections to the new servers as part of a peering protocol, and each peering connection consumed an OS thread.

The number of threads per front-end process scaled with the number of other front-end processes in the fleet. Adding capacity made every existing server’s thread count grow. Eventually the count hit the OS per-process thread limit; new thread creation failed; the front-end couldn’t accept connections or process incoming records.

The first-order failure is the unbounded growth of an internal data structure (thread count) with fleet size. The second-order failure — and the reason the outage was wide instead of narrow — is that a long list of AWS services use Kinesis for their own internal data planes, including CloudWatch (the very monitoring tool engineers were watching during the event), Cognito, EventBridge, and parts of IAM.

Cascading effects#

This is where the postmortem is most instructive. The Kinesis-alone outage is a service outage; the cascade is what made it regional.

  1. CloudWatch metrics ingestion degraded. AWS’s own monitoring system was partially blind during the worst of the event. Engineers had reduced visibility into what was failing.
  2. CloudWatch alarms didn’t fire reliably. Some customer-defined alarms didn’t trigger because metric data wasn’t arriving; others triggered on stale data.
  3. Cognito errors propagated. Authentication flows that touched Cognito returned errors. Customers using Cognito-backed sign-in saw end-user-visible failures.
  4. Lambda event-source integrations failed. Functions wired to Kinesis or DynamoDB Streams accumulated backpressure or returned errors to upstream callers.
  5. AWS console pages slowed. Many console pages query CloudWatch behind the scenes; with CloudWatch degraded, the console itself felt slow.
  6. AWS Status Dashboard updates lagged. The Status Dashboard relies on internal services that were themselves affected. Customers complained that status updates lagged the actual failures by tens of minutes.

The pattern repeats: the control plane (monitoring, dashboards, auth, observability) of many AWS services depended on a single data plane (Kinesis) that was itself in a failure mode. Recovery required bringing Kinesis back before the dependent services could recover, in a strict topological order.

What was fixed#

Per AWS’s published postmortem and follow-on announcements:

  • Move Kinesis front-end to a microsharded model where threads scale with shard count, not fleet size. The peering connection cost no longer grows with fleet capacity.
  • Increase OS thread limits and reduce per-process thread fanout as immediate mitigations during the recovery window.
  • Decouple CloudWatch’s own internal use of Kinesis from customer-facing CloudWatch. A failure in Kinesis should not blind the monitoring used by AWS’s own on-call engineers.
  • Improve the dependency map of AWS services internally, so adding capacity to a “leaf” service (Kinesis) cannot have unexpected ripple effects.
  • Status Dashboard updates moved to a path with fewer dependencies, so customers get accurate status sooner during multi-service events.

Lessons that generalize#

Concretely:

  • Capacity changes are not safe. Adding servers feels like an additive operation, but it’s actually a change to the cluster topology — peering connections, gossip protocols, internal data structures all change. Treat capacity additions with the same care as code deploys, including canary rollouts.
  • Resource limits scale with topology, not load. The Kinesis thread limit hit was a function of fleet size, not request volume. Capacity planning that only forecasts request volume misses these failures. Stress-test the operational shape (large fleet, many shards) before customers do.
  • Monitoring must not depend on what it monitors. CloudWatch dimming during a CloudWatch-impacting outage is the worst possible time for the monitoring system to lose fidelity. The “watch the watcher” problem is universal — every monitoring stack must have an out-of-band fallback (a different data path, a different storage, a different team to alert).
  • Cascading control-plane failures are the norm, not the exception, at scale. Whenever a popular shared service goes down, everything that depends on it for control-plane operations is also down. Designing services so their control plane is independent of any single shared infrastructure is the only way to bound blast radius.
  • Customer transparency during multi-service outages is operationally expensive. Status dashboards that themselves depend on the failing systems are common; designing the Status Dashboard’s hosting to be truly independent (different region, different stack, different infrastructure) is worth the cost.

A subtler lesson: the dependency graph that matters is the runtime graph, not the deployment graph. Two services that ship independently can still be tightly coupled through a third internal service neither owns. The reliability calculations that assume “each service is 99.99% available” don’t hold when their availabilities are not independent.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.