Load Balancers — System Design · Engineering Playbook

Use cases#

A load balancer (LB) sits between clients and a pool of identical backends and spreads work across them. Three distinct roles, often confused:

Spread load — fan requests evenly across N replicas so no single instance becomes the bottleneck.
Fail-aware routing — pull unhealthy instances out of rotation automatically; a backend crash should be a non-event for the client.
Decoupling — clients address a single stable name; the fleet behind it scales up, down, deploys, and rolls without coordination with callers.

Every internet-facing service has at least two tiers of LB: a global one that picks a region and a local one that picks an instance.

Functional requirements#

Accept TCP or HTTP traffic on a virtual IP / hostname.
Distribute connections or requests across a pool of backends per a configurable algorithm.
Run health checks against backends and remove failing ones from rotation.
Support TLS termination (L7) and certificate management.
Expose per-backend metrics (request rate, latency, error rate, in-flight connections).

Non-functional requirements#

Latency overhead: under 1 ms p99 for L4, under 5 ms for L7 with TLS termination. Anything higher and the LB itself becomes the bottleneck.
Availability: 99.99% minimum. The LB is on the critical path of every request; it must be more available than the service it fronts. Typically achieved with redundant active-active LB instances behind anycast IPs or DNS round-robin.
Throughput: a single modern LB instance (NGINX, HAProxy, Envoy on a dedicated VM) handles 100k-500k QPS of HTTP/1.1, more for HTTP/2. Hardware LBs (F5, A10) and SmartNIC-based LBs push into millions.
Connection scaling: long-lived connections (WebSocket, gRPC streams) need LBs that handle 100k+ concurrent connections per instance.

High-level design#

                ┌── region A ──┐
DNS ─> global LB ─┤              ├── local LB ─┬── app-1
        (anycast) └── region B ──┘             ├── app-2
                                               ├── app-3
                                               └── app-N
                                                     │
                                                     └─> DB / cache

Global LB lives at the DNS or anycast layer — it picks a region based on geo, latency, and weighted policies. Local LB lives inside the region — it picks one healthy backend instance per request. The two tiers exist because the algorithms and update frequencies differ: global rebalances in tens of seconds, local rebalances per request.

Detailed design#

L4 vs L7#

L4 (transport) load balancers route by source/destination IP and port. They forward packets without parsing the application protocol. Pros: low overhead (~100 µs), protocol-agnostic, works for any TCP/UDP. Cons: can’t read the request body, can’t route by URL or header.

L7 (application) load balancers parse HTTP (or gRPC, or any framed protocol) and route per-request by path, header, cookie. Pros: smart routing, URL-based traffic splitting, rich observability. Cons: higher latency (~1-5 ms), heavier CPU per byte, must terminate TLS to see plaintext.

A typical edge architecture: L4 LB in front of L7 LB. L4 absorbs DDoS at line rate; L7 handles routing rules behind it.

Algorithms#

Round-robin       — request i goes to backend i % N. Stateless, simple, ignores backend load.
Random            — uniform draw. Same expected balance as round-robin, simpler to scale.
Least connections — pick backend with fewest in-flight connections. Adapts to slow backends.
Least time        — pick backend with lowest p95 latency. Requires keeping per-backend stats.
Weighted          — assign per-backend weights (more powerful boxes get more work).
Consistent hash   — `hash(client_ip or session_id) mod ring` — same client always lands on
                    the same backend. Critical for sticky sessions and cache locality.

Consistent hashing#

Naive hash(key) mod N re-shuffles almost every key when N changes. Consistent hashing places backends on a ring and assigns each key to the next backend clockwise — adding or removing a backend only moves 1/N of keys. The classic application is the Distributed Cache; it’s also how Maglev (Google), Katran (Facebook), and Envoy implement session-affine load balancing.

Health checks#

Active: LB pings each backend on an interval (e.g. GET /healthz every 5 s). Pulls a backend after 3 consecutive failures.
Passive: LB watches real traffic. If a backend errors on 5 consecutive requests, eject for 30 s.
Outlier detection: Envoy ejects backends whose p99 latency is more than 2× the cluster median.

Health checks should test the dependency graph, not just the process. A backend with GET /healthz returning 200 while its database is unreachable is the worst kind of false-positive.

Sticky sessions#

When a backend caches per-user state in memory (e.g. WebSocket connection, in-progress upload), the LB must route the same user to the same backend. Two approaches:

Cookie-based: LB injects a cookie (AWSALB, JSESSIONID); subsequent requests carry it and route accordingly. L7 only.
Hash-based: consistent hash on client_ip or a tenant header. Works at L4 too.

Stickiness is a smell that the backend isn’t fully stateless. Move session state to Redis if you can.

TLS termination vs passthrough#

Terminating TLS at the LB simplifies certificate management and lets the LB inspect requests (essential for L7 routing). Passthrough (the LB only forwards TCP) preserves end-to-end encryption — required for mTLS service meshes and some compliance regimes.

Trade-offs#

Round-robin — uniform when backends are homogeneous and requests are uniform. Breaks when one backend is slow: piles up requests on the stuck one.

Least-connections / least-time — naturally drains traffic away from struggling backends. Costs the LB more state (per-backend in-flight counts) and is harder to scale across multi-instance LB tiers.

Other axes:

Synchronous vs reactive health checks — active is predictable but lags; passive reacts in real-time but can over-eject under spikes.
Local LB vs client-side LB — gRPC and service meshes push the LB into the client process (Envoy sidecar, gRPC’s xds). Saves one network hop but more complex to operate.
Hardware vs software — hardware LBs (F5) handle line-rate L4 but cost six figures and lock you into a vendor. Software LBs (HAProxy, NGINX, Envoy) run on commodity hardware and integrate with cloud APIs.

Real-world examples#

AWS ALB is an L7 LB (HTTP/HTTPS/gRPC) with path-based and host-based routing; uses round-robin by default and supports least-outstanding-requests.
AWS NLB is L4, scales to millions of connections per second, preserves source IP, uses flow-hash for stickiness.
Envoy is the LB-as-library inside Istio, AWS App Mesh, and many service meshes. Per-request L7 routing, outlier detection, retry policies.
HAProxy has been the workhorse open-source LB for two decades — Stack Overflow famously served all its traffic through two HAProxy instances on commodity hardware.
Cloudflare’s load balancer is global, anycast-based, with sub-second failover by withdrawing BGP routes from sick POPs.
Google Maglev routes traffic for many Google properties using consistent-hash-based L4 load balancing across thousands of nodes — described in the 2016 NSDI paper.

DNS — the entry point that picks the LB.
Distributed Cache — uses the same consistent-hash trick the LB does.
Content Delivery Network — its own multi-tier LB and routing layer.
Rate Limiter — often co-located with the L7 LB.