Interdomain Routing — BGP
Path-vector routing between ASes, policy-based selection, the trust model that makes BGP both essential and fragile.
What it is#
The Border Gateway Protocol is how every autonomous system (AS) on the Internet tells every other AS where to reach my address space. Where OSPF and RIP solve routing inside one administrative domain — where every router belongs to the same operator and shares the same trust model — BGP solves routing between domains, between operators who may compete, mistrust each other, and have non-technical reasons (business contracts, political relationships, regulatory constraints) to prefer one path over another.
BGP is a path-vector protocol — each route advertisement includes the full sequence of ASes the route traverses, not just a hop count. Routers select among paths using a policy layer that lets each AS encode arbitrary preferences. The protocol is small, well-defined, and has run the Internet’s interdomain routing for thirty years. Its weaknesses — the absence of a built-in trust model, the simplicity of route hijacking, the dependence on individual operators not making mistakes — show up in the headlines several times a year.
Every senior backend engineer should understand BGP at least well enough to read a public BGP-related postmortem (Facebook 2021, AWS 2019, Pakistan-YouTube 2008) and explain what went wrong.
When to use it#
BGP is not a thing you “use” — it’s a thing your AS runs if you connect directly to the Internet as an independent network. Concretely:
- Every Internet Service Provider runs BGP at every interconnection point.
- Cloud providers and large content networks run BGP at the edges of their networks; CDNs use BGP-Anycast to direct users to the closest edge.
- Large enterprises that buy transit from multiple ISPs (“multi-homed”) run BGP to choose which provider’s path to use for which destination.
- Smaller enterprises typically don’t run BGP — they take a default route from a single upstream ISP and let the ISP handle the rest.
If you’re operating a service behind a single hosting provider, BGP runs above you — your provider participates; you don’t. If you’re building network infrastructure at any meaningful scale, BGP is part of your job.
How it works#
Sessions and peers#
Two BGP routers establish a TCP session (port 179) and exchange route advertisements. The relationship is one of two kinds:
- External BGP (eBGP) — peers in different ASes. The session crosses an AS boundary.
- Internal BGP (iBGP) — peers in the same AS. Used to propagate routes learned from eBGP across the AS’s internal routers, since routes learned over eBGP aren’t automatically re-advertised over OSPF.
Path advertisements#
Each BGP advertisement carries a destination prefix (e.g., 185.31.16.0/22) and a set of path attributes. The most important:
Prefix: 185.31.16.0/22 ← destination network AS_PATH: [13335, 174, 7018] ← sequence of ASes traversed NEXT_HOP: 192.0.2.1 ← IP to forward packets to LOCAL_PREF: 200 ← internal preference (higher wins) MED: 50 ← multi-exit discriminator COMMUNITIES: 13335:1000, 13335:7018 ← tags for policy use ORIGIN: IGP ← how the route was originally injectedThe AS_PATH is the killer attribute. It serves two purposes:
- Loop prevention. A router that sees its own AS number in the AS_PATH of an incoming advertisement rejects the advertisement. This stops routing loops without needing a global view.
- Path comparison. When multiple routes to the same prefix exist, the shorter AS_PATH usually wins (after policy filters).
Path selection#
When a BGP router has multiple paths to the same prefix, it runs a fixed selection algorithm — often called the best path algorithm — to pick one. The well-known order (simplified):
- Highest
LOCAL_PREF(internal policy preference; the dominant lever) - Shortest
AS_PATH - Lowest
ORIGIN(IGP < EGP < incomplete) - Lowest
MED(suggested by the neighbour; only compared between paths from the same neighbour AS) - Prefer eBGP-learned over iBGP-learned
- Lowest IGP cost to the
NEXT_HOP - Lowest router ID, lowest peer IP (tie-breakers)
LOCAL_PREF is set by the local AS based on its policy — it’s how operators encode preferences like “prefer my paid transit provider over my peer”, “prefer my customer’s routes over my transit provider’s”, “send European traffic through Frankfurt”.
Variants#
BGP has accumulated extensions; the ones that matter in 2026:
- BGP-4 with 32-bit AS numbers (RFC 4893, 6793) — the modern baseline. The original BGP-4 used 16-bit AS numbers; that space is exhausted.
- MP-BGP (Multi-Protocol BGP) — extends BGP to carry routes for address families beyond IPv4 (IPv6, MPLS labels, VPN routes). Almost every production BGP deployment uses MP-BGP.
- Route Reflectors — a scaling solution to iBGP’s full-mesh requirement. Instead of every internal router peering with every other, a small set of reflectors collect advertisements and re-distribute them.
- Confederations — alternative scaling, less common than reflectors.
- BGP Communities — tag-based metadata on routes (32-bit values, often interpreted as
ASN:value). Used for policy expression: “tag this route as customer-learned”, “tag this route as do-not-export-to-tier-1”, etc. - RPKI (Resource Public Key Infrastructure) — a cryptographic system for validating that an AS is authorised to originate a given prefix. The defence against route hijacking; deployment is uneven but growing.
Trade-offs#
Other trade-off axes:
- Convergence is slow. A path withdrawal can take minutes to propagate fully; intermediate states (where some ASes have the new path and others don’t) are common. The Internet is eventually consistent on routing.
- Policy expressiveness vs operational complexity. Communities + local-pref + filters can express nearly any business relationship; that flexibility makes BGP configurations enormously complex (large ISPs have configuration files in the hundreds of thousands of lines).
- Path length is a weak signal. “Shortest AS path” doesn’t mean “shortest physical path” — an AS can be a single transit-rich peer or a continent-spanning network. Optimising for AS_PATH length alone produces poor latency outcomes.
- No built-in authentication. A router can announce any prefix it wants, and other routers will accept it unless they have an explicit filter or RPKI validation. This is the trust hole.
Common pitfalls#
- Route hijacks. An AS announcing a prefix it doesn’t own — accidentally (a typo’d configuration) or maliciously. The 2008 Pakistan-YouTube incident was an accidental hijack; the 2018 Amazon DNS hijack was malicious. Defence: RPKI + prefix filters at peer edges.
- Route leaks. An AS re-advertising routes it shouldn’t (e.g., re-advertising transit-provider routes back to another transit provider, effectively offering free transit). Cause of multiple historical outages.
- Configuration drift. Production BGP configs accumulate exceptions, special cases, and one-off rules over years. The Facebook 2021 outage was triggered by a correct command run in the wrong scope — a maintenance script meant to test backbone capacity withdrew the routes that announced Facebook’s DNS authoritative servers, taking the entire site offline globally for over six hours.
- Slow convergence misread as down. A peer that’s flapping (announcing then withdrawing repeatedly) makes downstream ASes oscillate. The defence is route damping — penalise unstable routes — but damping can also make legitimate flips slow to recover.
- iBGP full-mesh blow-up. Every internal router must peer with every other internal router by default. Past ~30 routers, you need route reflectors. Forgetting this and trying to add a 50th router to a full-mesh iBGP often takes the whole AS off-line.
- No graceful restart. A BGP daemon restart on a busy router can withdraw thousands of routes before the session re-establishes. Graceful Restart extensions exist; not every implementation supports them correctly.
Related building blocks#
- Intradomain Routing — OSPF
- Distance-Vector Routing and RIP
- IPv4 — Addressing, Subnets, Fragmentation
- Facebook 2021 — The BGP Withdrawal
The 'why is BGP so weird' question, answered
BGP is weird because it’s not really a routing protocol — it’s a policy protocol that happens to produce routing. The reason it doesn’t use link-state or shortest-path-first like OSPF is that there’s no shared notion of “cost” across ASes; Cogent and Level 3 don’t have the same operational priorities. Path-vector with policy filters lets each AS encode its own definition of “good path” without needing agreement on any global metric. That decision in 1989 has carried thirty-plus years of Internet routing — and it’s why every major Internet outage where you read “BGP” in the headline is downstream of someone’s policy not their protocol.