Load Balancers

L4 vs L7, global vs local, algorithms (round-robin, least-connections, consistent-hash), placement tiers.

Building Block Foundational
6 min read
load-balancing networking traffic-routing
Companies this resembles: NGINX · HAProxy · AWS ELB · Envoy · Cloudflare

Use cases#

A load balancer (LB) sits between clients and a pool of identical backends and spreads work across them. Three distinct roles, often confused:

  • Spread load — fan requests evenly across N replicas so no single instance becomes the bottleneck.
  • Fail-aware routing — pull unhealthy instances out of rotation automatically; a backend crash should be a non-event for the client.
  • Decoupling — clients address a single stable name; the fleet behind it scales up, down, deploys, and rolls without coordination with callers.

Every internet-facing service has at least two tiers of LB: a global one that picks a region and a local one that picks an instance.

Functional requirements#

  • Accept TCP or HTTP traffic on a virtual IP / hostname.
  • Distribute connections or requests across a pool of backends per a configurable algorithm.
  • Run health checks against backends and remove failing ones from rotation.
  • Support TLS termination (L7) and certificate management.
  • Expose per-backend metrics (request rate, latency, error rate, in-flight connections).

Non-functional requirements#

  • Latency overhead: under 1 ms p99 for L4, under 5 ms for L7 with TLS termination. Anything higher and the LB itself becomes the bottleneck.
  • Availability: 99.99% minimum. The LB is on the critical path of every request; it must be more available than the service it fronts. Typically achieved with redundant active-active LB instances behind anycast IPs or DNS round-robin.
  • Throughput: a single modern LB instance (NGINX, HAProxy, Envoy on a dedicated VM) handles 100k-500k QPS of HTTP/1.1, more for HTTP/2. Hardware LBs (F5, A10) and SmartNIC-based LBs push into millions.
  • Connection scaling: long-lived connections (WebSocket, gRPC streams) need LBs that handle 100k+ concurrent connections per instance.

High-level design#

┌── region A ──┐
DNS ─> global LB ─┤ ├── local LB ─┬── app-1
(anycast) └── region B ──┘ ├── app-2
├── app-3
└── app-N
└─> DB / cache

Global LB lives at the DNS or anycast layer — it picks a region based on geo, latency, and weighted policies. Local LB lives inside the region — it picks one healthy backend instance per request. The two tiers exist because the algorithms and update frequencies differ: global rebalances in tens of seconds, local rebalances per request.

Detailed design#

L4 vs L7#

L4 (transport) load balancers route by source/destination IP and port. They forward packets without parsing the application protocol. Pros: low overhead (~100 µs), protocol-agnostic, works for any TCP/UDP. Cons: can’t read the request body, can’t route by URL or header.

L7 (application) load balancers parse HTTP (or gRPC, or any framed protocol) and route per-request by path, header, cookie. Pros: smart routing, URL-based traffic splitting, rich observability. Cons: higher latency (~1-5 ms), heavier CPU per byte, must terminate TLS to see plaintext.

A typical edge architecture: L4 LB in front of L7 LB. L4 absorbs DDoS at line rate; L7 handles routing rules behind it.

Algorithms#

Round-robin — request i goes to backend i % N. Stateless, simple, ignores backend load.
Random — uniform draw. Same expected balance as round-robin, simpler to scale.
Least connections — pick backend with fewest in-flight connections. Adapts to slow backends.
Least time — pick backend with lowest p95 latency. Requires keeping per-backend stats.
Weighted — assign per-backend weights (more powerful boxes get more work).
Consistent hash — `hash(client_ip or session_id) mod ring` — same client always lands on
the same backend. Critical for sticky sessions and cache locality.

Consistent hashing#

Naive hash(key) mod N re-shuffles almost every key when N changes. Consistent hashing places backends on a ring and assigns each key to the next backend clockwise — adding or removing a backend only moves 1/N of keys. The classic application is the Distributed Cache; it’s also how Maglev (Google), Katran (Facebook), and Envoy implement session-affine load balancing.

Health checks#

  • Active: LB pings each backend on an interval (e.g. GET /healthz every 5 s). Pulls a backend after 3 consecutive failures.
  • Passive: LB watches real traffic. If a backend errors on 5 consecutive requests, eject for 30 s.
  • Outlier detection: Envoy ejects backends whose p99 latency is more than 2× the cluster median.

Health checks should test the dependency graph, not just the process. A backend with GET /healthz returning 200 while its database is unreachable is the worst kind of false-positive.

Sticky sessions#

When a backend caches per-user state in memory (e.g. WebSocket connection, in-progress upload), the LB must route the same user to the same backend. Two approaches:

  • Cookie-based: LB injects a cookie (AWSALB, JSESSIONID); subsequent requests carry it and route accordingly. L7 only.
  • Hash-based: consistent hash on client_ip or a tenant header. Works at L4 too.

Stickiness is a smell that the backend isn’t fully stateless. Move session state to Redis if you can.

TLS termination vs passthrough#

Terminating TLS at the LB simplifies certificate management and lets the LB inspect requests (essential for L7 routing). Passthrough (the LB only forwards TCP) preserves end-to-end encryption — required for mTLS service meshes and some compliance regimes.

Trade-offs#

Round-robin — uniform when backends are homogeneous and requests are uniform. Breaks when one backend is slow: piles up requests on the stuck one.
Least-connections / least-time — naturally drains traffic away from struggling backends. Costs the LB more state (per-backend in-flight counts) and is harder to scale across multi-instance LB tiers.

Other axes:

  • Synchronous vs reactive health checks — active is predictable but lags; passive reacts in real-time but can over-eject under spikes.
  • Local LB vs client-side LB — gRPC and service meshes push the LB into the client process (Envoy sidecar, gRPC’s xds). Saves one network hop but more complex to operate.
  • Hardware vs software — hardware LBs (F5) handle line-rate L4 but cost six figures and lock you into a vendor. Software LBs (HAProxy, NGINX, Envoy) run on commodity hardware and integrate with cloud APIs.

Real-world examples#

  • AWS ALB is an L7 LB (HTTP/HTTPS/gRPC) with path-based and host-based routing; uses round-robin by default and supports least-outstanding-requests.
  • AWS NLB is L4, scales to millions of connections per second, preserves source IP, uses flow-hash for stickiness.
  • Envoy is the LB-as-library inside Istio, AWS App Mesh, and many service meshes. Per-request L7 routing, outlier detection, retry policies.
  • HAProxy has been the workhorse open-source LB for two decades — Stack Overflow famously served all its traffic through two HAProxy instances on commodity hardware.
  • Cloudflare’s load balancer is global, anycast-based, with sub-second failover by withdrawing BGP routes from sick POPs.
  • Google Maglev routes traffic for many Google properties using consistent-hash-based L4 load balancing across thousands of nodes — described in the 2016 NSDI paper.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.