Load Balancers
L4 vs L7, global vs local, algorithms (round-robin, least-connections, consistent-hash), placement tiers.
Use cases#
A load balancer (LB) sits between clients and a pool of identical backends and spreads work across them. Three distinct roles, often confused:
- Spread load — fan requests evenly across N replicas so no single instance becomes the bottleneck.
- Fail-aware routing — pull unhealthy instances out of rotation automatically; a backend crash should be a non-event for the client.
- Decoupling — clients address a single stable name; the fleet behind it scales up, down, deploys, and rolls without coordination with callers.
Every internet-facing service has at least two tiers of LB: a global one that picks a region and a local one that picks an instance.
Functional requirements#
- Accept TCP or HTTP traffic on a virtual IP / hostname.
- Distribute connections or requests across a pool of backends per a configurable algorithm.
- Run health checks against backends and remove failing ones from rotation.
- Support TLS termination (L7) and certificate management.
- Expose per-backend metrics (request rate, latency, error rate, in-flight connections).
Non-functional requirements#
- Latency overhead: under 1 ms p99 for L4, under 5 ms for L7 with TLS termination. Anything higher and the LB itself becomes the bottleneck.
- Availability: 99.99% minimum. The LB is on the critical path of every request; it must be more available than the service it fronts. Typically achieved with redundant active-active LB instances behind anycast IPs or DNS round-robin.
- Throughput: a single modern LB instance (NGINX, HAProxy, Envoy on a dedicated VM) handles 100k-500k QPS of HTTP/1.1, more for HTTP/2. Hardware LBs (F5, A10) and SmartNIC-based LBs push into millions.
- Connection scaling: long-lived connections (WebSocket, gRPC streams) need LBs that handle 100k+ concurrent connections per instance.
High-level design#
┌── region A ──┐DNS ─> global LB ─┤ ├── local LB ─┬── app-1 (anycast) └── region B ──┘ ├── app-2 ├── app-3 └── app-N │ └─> DB / cacheGlobal LB lives at the DNS or anycast layer — it picks a region based on geo, latency, and weighted policies. Local LB lives inside the region — it picks one healthy backend instance per request. The two tiers exist because the algorithms and update frequencies differ: global rebalances in tens of seconds, local rebalances per request.
Detailed design#
L4 vs L7#
L4 (transport) load balancers route by source/destination IP and port. They forward packets without parsing the application protocol. Pros: low overhead (~100 µs), protocol-agnostic, works for any TCP/UDP. Cons: can’t read the request body, can’t route by URL or header.
L7 (application) load balancers parse HTTP (or gRPC, or any framed protocol) and route per-request by path, header, cookie. Pros: smart routing, URL-based traffic splitting, rich observability. Cons: higher latency (~1-5 ms), heavier CPU per byte, must terminate TLS to see plaintext.
A typical edge architecture: L4 LB in front of L7 LB. L4 absorbs DDoS at line rate; L7 handles routing rules behind it.
Algorithms#
Round-robin — request i goes to backend i % N. Stateless, simple, ignores backend load.Random — uniform draw. Same expected balance as round-robin, simpler to scale.Least connections — pick backend with fewest in-flight connections. Adapts to slow backends.Least time — pick backend with lowest p95 latency. Requires keeping per-backend stats.Weighted — assign per-backend weights (more powerful boxes get more work).Consistent hash — `hash(client_ip or session_id) mod ring` — same client always lands on the same backend. Critical for sticky sessions and cache locality.Consistent hashing#
Naive hash(key) mod N re-shuffles almost every key when N changes. Consistent hashing places backends on a ring and assigns each key to the next backend clockwise — adding or removing a backend only moves 1/N of keys. The classic application is the Distributed Cache; it’s also how Maglev (Google), Katran (Facebook), and Envoy implement session-affine load balancing.
Health checks#
- Active: LB pings each backend on an interval (e.g.
GET /healthzevery 5 s). Pulls a backend after 3 consecutive failures. - Passive: LB watches real traffic. If a backend errors on 5 consecutive requests, eject for 30 s.
- Outlier detection: Envoy ejects backends whose p99 latency is more than 2× the cluster median.
Health checks should test the dependency graph, not just the process. A backend with GET /healthz returning 200 while its database is unreachable is the worst kind of false-positive.
Sticky sessions#
When a backend caches per-user state in memory (e.g. WebSocket connection, in-progress upload), the LB must route the same user to the same backend. Two approaches:
- Cookie-based: LB injects a cookie (
AWSALB,JSESSIONID); subsequent requests carry it and route accordingly. L7 only. - Hash-based: consistent hash on
client_ipor a tenant header. Works at L4 too.
Stickiness is a smell that the backend isn’t fully stateless. Move session state to Redis if you can.
TLS termination vs passthrough#
Terminating TLS at the LB simplifies certificate management and lets the LB inspect requests (essential for L7 routing). Passthrough (the LB only forwards TCP) preserves end-to-end encryption — required for mTLS service meshes and some compliance regimes.
Trade-offs#
Other axes:
- Synchronous vs reactive health checks — active is predictable but lags; passive reacts in real-time but can over-eject under spikes.
- Local LB vs client-side LB — gRPC and service meshes push the LB into the client process (Envoy sidecar, gRPC’s
xds). Saves one network hop but more complex to operate. - Hardware vs software — hardware LBs (F5) handle line-rate L4 but cost six figures and lock you into a vendor. Software LBs (HAProxy, NGINX, Envoy) run on commodity hardware and integrate with cloud APIs.
Real-world examples#
- AWS ALB is an L7 LB (HTTP/HTTPS/gRPC) with path-based and host-based routing; uses round-robin by default and supports least-outstanding-requests.
- AWS NLB is L4, scales to millions of connections per second, preserves source IP, uses flow-hash for stickiness.
- Envoy is the LB-as-library inside Istio, AWS App Mesh, and many service meshes. Per-request L7 routing, outlier detection, retry policies.
- HAProxy has been the workhorse open-source LB for two decades — Stack Overflow famously served all its traffic through two HAProxy instances on commodity hardware.
- Cloudflare’s load balancer is global, anycast-based, with sub-second failover by withdrawing BGP routes from sick POPs.
- Google Maglev routes traffic for many Google properties using consistent-hash-based L4 load balancing across thousands of nodes — described in the 2016 NSDI paper.
Related building blocks#
- DNS — the entry point that picks the LB.
- Distributed Cache — uses the same consistent-hash trick the LB does.
- Content Delivery Network — its own multi-tier LB and routing layer.
- Rate Limiter — often co-located with the L7 LB.