Facebook / WhatsApp / Instagram — 2021 BGP outage — System Design

Summary#

On 4 October 2021, an automated audit command intended to assess the capacity of Facebook’s backbone network instead withdrew the BGP routes for Facebook’s DNS prefixes. Within seconds, the rest of the internet stopped being able to find any Facebook domain — facebook.com, whatsapp.com, instagram.com, oculus.com, the corporate domains used by employees, and the management interfaces of Facebook’s own datacenters.

The outage lasted almost six hours and is one of the canonical “small command, global blast radius” failures in distributed-systems history.

Timeline#

All times UTC.

15:39 — engineer runs a routine BGP capacity audit. A bug in the audit tool causes it to instead withdraw the BGP advertisements for Facebook’s authoritative DNS servers.
15:40 — within one minute, recursive DNS resolvers worldwide can no longer reach Facebook’s DNS. Every Facebook product becomes unreachable.
15:45 — internal communication tools (built on the same infrastructure) also go dark. Slack-style internal chat, internal email, and conference rooms stop working.
16:00 — engineers attempt remote remediation from home; the corporate VPN depends on the same DNS path and cannot resolve internal hostnames.
16:30 — engineers travel to physical datacenters; the badge-reader systems on the doors authenticate against the same now-unreachable directory service. They cannot enter.
18:00 — engineers physically force entry to one datacenter and connect a console cable directly to a routing device.
21:05 — BGP routes are re-advertised. Recursive DNS caches start being populated again.
21:45 — services begin returning. Full recovery takes another several hours as caches re-warm.

Root cause#

A bug in an audit-time validation check failed to flag a configuration command that would withdraw the BGP advertisements for the prefixes holding Facebook’s DNS servers. The command executed successfully, propagated through BGP within seconds, and Facebook’s authoritative nameservers became unreachable from the public internet.

The first-order failure is the bug in the validation logic. The second-order failure — and the one that made the outage long instead of short — is that every recovery mechanism depended on the network being reachable.

Cascading effects#

This is where the postmortem gets interesting. The BGP withdrawal alone is a 15-minute outage; many companies have had similar bugs and recovered quickly. The cascade that turned 15 minutes into 6 hours:

External DNS gone. No public-facing Facebook domain resolves.
Internal DNS depends on external DNS. Internal tooling resolves *.internal.fb.com through resolvers that fall back to public records during a refresh. With the public records unreachable, internal DNS started serving stale data and then failing.
Auth depends on DNS. Engineers’ tools couldn’t authenticate to internal systems because the auth service’s hostname didn’t resolve.
Corporate office tools depend on auth. Slack-equivalent, email, scheduling, internal docs — all unreachable.
Badge readers depend on directory service. Engineers couldn’t enter datacenters or, in some buildings, even leave secured floors.
Out-of-band management network was the same network. The “break glass” path that exists specifically for this scenario shared infrastructure with the failed production path.

The recovery sequence required physically driving to a datacenter, getting buildings security to mechanically unlock doors, and connecting a console cable directly to a routing device — a procedure that hadn’t been drilled at the scale required.

What was fixed#

Public information (from Facebook’s own engineering blog and several engineers who wrote about the incident in months after):

The BGP audit tool gained stricter pre-flight validation: dry-run checks against a model of the network, plus an explicit “this command would withdraw routes serving DNS prefixes” warning gate.
DNS prefixes serving the recursive resolution path are now tagged at the BGP level so the validation tool can refuse to withdraw them without an explicit override.
The out-of-band management network was rebuilt to use entirely separate physical infrastructure, separate name resolution, and separate authentication that does not depend on the production directory service.
Datacenter physical access procedures were updated so that emergency entry doesn’t require online directory authentication; the on-site security team has manual override authority.
Internal communication tooling for incident response was moved to a service that does not depend on Facebook DNS — including some manual fall-back to third-party group chat tools so engineers can coordinate while internal Slack is down.

Lessons that generalize#

Concretely:

Out-of-band must mean out-of-band, not “the same network on a different subnet”. Different physical fiber, different DNS, different auth.
Physical access procedures must survive a directory-service outage. This is not paranoia — it’s the actual failure mode that happened.
Pre-flight checks on commands that can cause global blast radius are mandatory. A 5-minute pause to confirm “yes, this will withdraw BGP routes for DNS prefixes” would have prevented the entire event.
Audit and production are not the same risk profile. A command run by an audit tool can be exactly as catastrophic as a deliberate change — if anything more so, because audits are casual.

A subtler lesson: at scale, blast radius is determined by the dependency graph of recovery, not by the system being changed. A 99.999% available service with a 50%-available recovery path is a 50%-available service the moment it breaks. Always test the recovery path under realistic failure scenarios — not just the steady state.

Facebook’s own postmortem: their engineering blog post published the following day is the canonical primary source.
AWS Kinesis — 2020 us-east-1 outage — different trigger (thread-limit exhaustion), same shape (control-plane dependency on data-plane).
Failure Models — the spectrum of failure assumptions that need stress-testing.
Fault Tolerance — graceful-degradation patterns and where they would have helped.

Summary#

Timeline#

Root cause#

Cascading effects#

What was fixed#

Lessons that generalize#

Related material#