Facebook 2021 — The BGP Withdrawal — Computer Networks

What happened#

On 4 October 2021, at approximately 15:39 UTC, Facebook, Instagram, WhatsApp, Messenger, Oculus, and the entire Facebook (Meta) corporate network disappeared from the Internet. For roughly six hours, facebook.com would not resolve in DNS; even when its IP addresses were known, packets to those addresses had no route. Meta’s authoritative DNS servers were unreachable, the BGP routes that advertised them to the Internet had been withdrawn, and the company’s own internal tools — including the ones engineers would normally use to fix the problem — relied on the same network that had gone dark.

Meta engineers had to physically gain access to a datacentre to revert the configuration. The recovery began around 21:00 UTC and full service restoration was announced later that evening. The public summary on engineering.fb.com (4 October 2021) and the more detailed post a few days later are the canonical sources.

Context#

Facebook’s network was unusual in its degree of vertical integration. The company operated its own backbone (a private global network connecting its datacentres), its own edge POPs, its own authoritative DNS servers, and announced its own IP prefixes to the Internet via BGP from those POPs and datacentres. The architectural decision was the right one for performance, cost, and control at Facebook’s scale; it also meant that Meta’s Internet presence depended on a small number of operations the company performed on its own equipment.

Authoritative DNS for facebook.com, instagram.com, whatsapp.com, and other Meta domains was served from name servers a.ns.facebook.com through d.ns.facebook.com. The IP prefixes containing those name servers were advertised to the Internet by Meta’s BGP routers at the edge of each datacentre. If the name servers stopped being reachable — for any reason — the Internet had no way to look up any of Meta’s services.

This single-point-of-failure was implicit. The DNS service itself was healthy and globally distributed; the routability of its IP addresses was the weak link.

Trigger and propagation#

A routine maintenance command was issued to assess available capacity on the global backbone. The command had a bug — it unintentionally took down all the BGP-speaking routers at Meta’s datacentres, withdrawing the routes that announced Meta’s IP prefixes (including the prefixes containing its authoritative DNS servers) to the Internet.

There was an automated audit tool designed to catch exactly this class of command and refuse to execute it. The tool had a bug too, and didn’t catch it.

Within seconds, the BGP withdrawal propagated through the Internet’s routing tables. Every BGP-speaking router in the world that previously knew how to reach Meta’s prefixes removed them from its tables. From the outside, Meta simply ceased to exist on the Internet — not slow, not degraded, gone. DNS resolvers looking up facebook.com got SERVFAIL because the authoritative servers were unreachable; the resolver pool retried, the queries cascaded, and global DNS infrastructure saw a step-function increase in load.

15:39  Maintenance command issued
15:40  BGP routes for Meta's prefixes withdrawn globally
15:40-15:45  DNS resolvers worldwide fail to resolve Meta domains
            Recursive resolvers retry, multiplying query load
            DNS root and TLD servers see anomalous traffic
~16:00  Meta engineers identify the issue
        Cannot remotely access affected routers (the network is down)
        Internal tools rely on the same network — also unreachable
~17:00-19:00  Engineers physically travel to datacentres
              Badge readers, internal communications, video conferencing
              all rely on the network → also degraded
~19:00-21:00  Physical access gained; routers manually restored
~21:00  BGP announcements resume; routes propagate
21:00-22:30  DNS resolves, service comes back, backlog drains

Detection and response#

External detection was immediate — within seconds of the BGP withdrawal, every monitoring system on the Internet that watched Meta’s services lit up. Internal detection at Meta was also immediate, but the response was crippled by the same outage that triggered the alarm.

Three layers of dependency made recovery slow:

The remote-management network shared infrastructure with the production network. Engineers who normally SSH’d to routers from anywhere in the world could not reach them; the management plane was on the same down BGP-advertised network.
Internal collaboration tools were on the same network. Workplace, internal Slack-equivalent, video conferencing, calendar, paging — all dependent on Meta’s own network. Engineers could not effectively coordinate via the company’s own infrastructure during the outage.
Physical-access systems were on the same network. Some badge readers and door access systems also relied on Meta’s network. Reports indicated engineers had difficulty entering parts of the datacentre to get hands-on access to the affected hardware.

The eventual recovery required physical presence at the affected datacentres to manually issue commands restoring the BGP advertisements. Once routes were re-announced, BGP propagation across the Internet took only minutes; the longest tail of the outage was the cold start on DNS caches and the backlog of clients retrying.

Root cause#

The proximate cause was a maintenance command that withdrew BGP advertisements. The audit tool that should have caught the command had a bug.

The deeper causes are architectural:

Single point of failure on the authoritative DNS prefix. Meta’s authoritative DNS lived behind a small number of BGP-advertised prefixes from Meta’s own infrastructure. When those prefixes were withdrawn, DNS itself became unreachable. Distributing authoritative DNS across multiple independent ASes (or using a third-party DNS provider as a secondary) would have kept name resolution working.
Out-of-band access was not actually out-of-band. The “management plane” relied on the same network as production. A true OOB network would have used different physical paths, different ASes, different ISPs — so that a failure of the production network would not impair the recovery channel.
Internal tools dependent on the failing network. Engineers could not effectively communicate using the company’s own collaboration tools during the outage. The recovery workflow assumed a healthy network.
Physical-access dependency. Some site-access systems were on the affected network, slowing physical recovery.

Lessons and changes#

Per Meta’s public statements:

Audit-tool bug fixed. The tool that should have rejected the bad command was patched.
Stricter change-management around commands that affect global BGP advertisements. Treat them as the highest-risk class of change, with reviews, soak times, and dry-runs.
Improved out-of-band access for emergency recovery. Recovery paths that don’t traverse the production network and don’t depend on production services.
Re-evaluation of internal-tool dependencies. Tools that engineers need during outages must work when production is down.

The full set of internal changes is not all public; Meta released a high-level summary rather than a deep technical postmortem in the style Cloudflare or AWS typically publish.

What it teaches in general#

Generalising:

Authoritative DNS is a single point of failure if it shares fate with the rest of your infrastructure. Many companies use a third-party DNS provider as a secondary authoritative explicitly to survive their own-network outages. Trade-off: external dependencies have their own failure modes, but the failures are uncorrelated.
BGP is fragile by design. It trusts the announcements it hears. Routes withdrawn intentionally are indistinguishable from routes withdrawn by mistake. There’s no transactional safety net at BGP scale — once a withdrawal propagates, every router in the world has acted on it.
The recovery channel must not depend on what it recovers. A management plane, an internal-tool stack, and a physical-access stack must all have at least one path that does not run over the same network as production. This is hard to maintain over years of engineering pressure to consolidate.
A maintenance command can have a worse blast radius than a code deploy. Code deploys go through canary, soak, automated rollback. Ad-hoc commands run by humans against live systems usually don’t. Yet they can take down everything. Tightening the gate on ops commands is at least as important as tightening the gate on code.
DNS cascades make outages visible faster than the cause. Recursive resolvers retry aggressively; a sudden DNS failure produces a spike in global DNS query load that operators and ISPs see immediately. The visible symptom often outruns the technical cause.

Cloudflare 2019 — a bad WAF rule deployed globally and simultaneously took down every edge node. Trigger was a regex; mitigation was a kill-switch. Recovery in 27 minutes because the kill-switch already existed and worked.

Meta 2021 — a maintenance command withdrew BGP routes for the entire company’s nameservers. Trigger was a CLI mistake plus an audit-tool bug; mitigation required physical access to a datacentre because the management plane was on the same network. Recovery in 6 hours because OOB had silently become in-band.

Could a third-party secondary DNS have helped?

Partially. If facebook.com had been served by both Meta’s own nameservers and (say) AWS Route 53 as a secondary, DNS lookups would have continued to resolve. But the IP addresses returned would still point to Meta’s prefixes, which were no longer routable. Clients would have got A records but couldn’t reach the servers. Useful for partial outages where DNS is degraded but the underlying network is fine; not a fix for “all your prefixes are withdrawn”. The complete fix requires DNS resolution and prefix routability to be independent — and at Meta’s scale, that means announcing the same prefix from multiple uncorrelated ASes, which is operationally complex.

The Meta engineering blog post (engineering.fb.com/2021/10/05/networking-traffic/outage/) is the canonical primary source.
Cloudflare 2019 — The Regex Outage — different layer, similar shape: a single global change took out a global service.
AWS us-east-1 2017 — The S3 Outage — also small-trigger / large-blast-radius, but at the application layer.
DNS — Hierarchy, Records, and Query Resolution
What Is the Internet?