TCP Congestion Control — AIMD, Slow Start, Fast Retransmit

The four canonical algorithms (slow start, congestion avoidance, fast retransmit, fast recovery) and why AIMD converges to fairness.

Building Block Intermediate
9 min read
tcp congestion-control aimd slow-start cubic bbr

What it is#

TCP congestion control is the sender’s mechanism for not overwhelming the network — distinct from flow control, which protects the receiver. The sender maintains a congestion window cwnd, in bytes, alongside the receive window rwnd; the effective send window is min(rwnd, cwnd). cwnd ramps up when packets are getting through and shrinks when loss appears. The dance of growth and contraction is what keeps the Internet from collapsing.

Four canonical algorithms, layered: slow start (probe how fast the path can carry traffic), congestion avoidance (linear additive increase past a threshold), fast retransmit (resend on three duplicate ACKs without waiting for the timer), fast recovery (don’t drop cwnd all the way to 1 after a fast retransmit — halve it and continue). Together they form Reno-style AIMD: Additive Increase, Multiplicative Decrease. Modern Linux defaults to CUBIC, which keeps the shape but uses a cubic function instead of linear growth. Google’s BBR replaces the loss-based model entirely with explicit bandwidth and RTT measurement.

When to use it#

Congestion control is always on; you don’t disable it. What you choose:

  • The algorithm. Linux exposes net.ipv4.tcp_congestion_controlcubic (default), bbr, reno, vegas, others. Pick bbr for high-throughput long-fat paths (Google’s edge, CDN cross-region); cubic is fine for most things.
  • Initial congestion window. RFC 6928 raised the default to 10 segments. CDN edges sometimes push it higher to skip slow start for short HTTP responses. Trade-off: bursty pacing risks tail drops at routers with shallow queues.
  • Pacing. Linux can pace cwnd over the RTT (sk_pacing_rate) instead of bursting. BBR and Fair-Queue scheduling depend on this; CUBIC tolerates either.

When to actually think about it:

  • Long-fat networks (high bandwidth × high RTT). Single-flow throughput is limited by cwnd / RTT. CUBIC reaches a 1 GB window slowly; BBR’s bandwidth-probing model fills the pipe faster.
  • Lossy wireless paths. Reno/CUBIC interpret loss as congestion and back off. On Wi-Fi or LTE, loss is often link-layer noise, not router queues — backing off costs throughput. BBR (loss-tolerant) and link-layer ARQ both help.
  • Cross-region replication, backups, video. Anywhere a single flow needs to push tens of MB/s for minutes. Algorithm choice matters.

How it works#

Slow start#

A new connection has no idea what the path can handle. Start cwnd = 10 * MSS (recent default); double cwnd every RTT (one additional MSS for every ACK received) until either a loss occurs or cwnd reaches ssthresh (slow-start threshold).

cwnd over RTTs during slow start:
RTT 1: 10
RTT 2: 20
RTT 3: 40
RTT 4: 80
...

Despite the name, slow start is exponential — fast in the steady state. “Slow” is relative to the historical alternative of opening with a huge cwnd and overrunning the path.

Congestion avoidance#

Once cwnd >= ssthresh, switch to linear growth — add one MSS per RTT (one MSS / cwnd ACK in segments). This is the “additive increase” of AIMD. The sender now grows cautiously, probing for capacity rather than rushing.

Loss detection and reaction#

Two ways to detect loss:

  1. Retransmission timeout (RTO). The catastrophic case. Set ssthresh = cwnd / 2, reset cwnd = 1 * MSS, return to slow start. Costly — every connection that experiences an RTO loses minutes to ramp back up.
  2. Three duplicate ACKs (fast retransmit). The lighter touch. Set ssthresh = cwnd / 2, set cwnd = ssthresh + 3 * MSS (count the three dup-ACKs as packets still in flight), retransmit the missing segment, enter fast recovery. After the lost segment is acked, cwnd = ssthresh and resume congestion avoidance.

This pair — halve on duplicate ACK, drop to 1 on RTO — is the “multiplicative decrease” of AIMD.

The AIMD sawtooth#

cwnd
|
| /\ /\ /\
| / \ / \ / \
| / \ / \ / \
| / \ / \ / \
| / \ / \ / \
| / \/ \/ \
| /
|---+----------------------------------------> time
slow start cong. avoidance loss → halve

Linear up, halve on loss. The shape repeats; the sender lives at roughly 75% of bottleneck capacity on average.

Why AIMD converges to fairness#

Two flows sharing a bottleneck: each grows additively when there’s no loss; both halve when loss occurs. Plot one flow’s cwnd against the other’s. Additive increase moves the point along a 45° line; multiplicative decrease moves it along a line through the origin. The intersection of those two motions converges to the fair-share point — equal cwnd for both flows. This is the classical Chiu-Jain result.

flow 2 cwnd
^
| . . . . . fairness line (cwnd1 == cwnd2)
| / ↗
| / ↗
| / ↗ ← additive increase: equal slope, parallel to 45°
|/ ↘ ← multiplicative decrease: toward origin
+----------> flow 1 cwnd

Multiplicative-increase / additive-decrease would diverge. Additive-increase / multiplicative-decrease converges. That’s the deep reason TCP uses AIMD specifically.

Fast retransmit and fast recovery#

Loss recovery used to be “wait for the RTO, retransmit, restart slow-start.” Fast retransmit (Reno) added: if three duplicate ACKs arrive, retransmit immediately — the receiver got later segments but not this one, so it’s almost certainly lost.

Fast recovery (NewReno, later default) refined the reaction: don’t drop cwnd to 1. Halve ssthresh, set cwnd = ssthresh + 3 to account for the dup-ACKs, retransmit, and stay in congestion avoidance. Far less throughput hit than a full RTO recovery.

Variants#

  • TCP Reno — original AIMD with fast retransmit / fast recovery. Reference implementation; rarely the default today.
  • TCP NewReno (RFC 6582) — fixes Reno’s behaviour when multiple segments are lost in one window. Still cumulative-ACK only.
  • TCP CUBIC — Linux default since 2.6.19. cwnd grows as a cubic function of time since the last loss. Plateaus near the previous high-water mark, then probes past it. Better than Reno for high-bandwidth-delay-product paths.
  • TCP BBR (Bottleneck Bandwidth and Round-trip propagation time) — Google’s loss-independent algorithm. Estimates bottleneck bandwidth (max delivery rate seen) and minimum RTT, sets cwnd to keep the pipe full without overflowing the bottleneck queue. Now widespread on Google services and YouTube.
  • TCP Vegas — RTT-based. Senses queueing by RTT inflation, not by loss. Falls behind in mixed deployments with loss-based flows (which take more share).
  • DCTCP (Data Center TCP) — for datacenter switches with ECN marking. Uses ECN as a signal to ramp cwnd down proportionally rather than halving. Lower latency in datacenter fabrics.
  • TCP Westwood / Westwood+ — wireless-friendly; estimates available bandwidth from ACK rate and uses that to set ssthresh after loss.

Trade-offs#

CUBIC — loss-based; widely deployed; fair to other loss-based flows. Slow to fill long-fat networks; spurious back-off on wireless loss; gets pushed around by aggressive flows.
BBR — model-based; fills the pipe faster; tolerates random loss; doesn’t depend on router drop signals. Can be unfair to CUBIC in mixed deployments (BBR’s bandwidth model lets it out-throughput loss-based competitors); BBRv1 caused queue buildup in some scenarios — BBRv2 / v3 address this.

Other tensions:

  • Aggressiveness vs fairness. A more aggressive algorithm gets more throughput in mixed deployments — at the cost of unfairness. CUBIC was a measured step past Reno; BBR is a step further. Standards bodies care about “TCP-friendly” behaviour.
  • Loss-based vs delay-based vs model-based. Loss-based (Reno, CUBIC): packet drops are the signal. Delay-based (Vegas): RTT inflation. Model-based (BBR): direct bandwidth + RTT measurement. Each has different deployment trade-offs.
  • Initial cwnd. Larger IW skips slow start for short responses — wins on HTTP. Too large and a burst from a CDN edge overruns shallow router queues, causing tail drops that hurt page-load time. RFC 6928’s 10 is the current consensus.
  • Pacing vs bursting. A bursty sender transmits its window worth of segments back-to-back. A pacing sender spreads them over the RTT. Pacing reduces microburst loss; some legacy NICs and middleboxes don’t behave well under it.
Why exactly does multiplicative decrease cause convergence?

When you halve a flow’s cwnd, you halve its share of the bottleneck. Two unequal flows that halve simultaneously get closer in absolute terms — (20, 10) becomes (10, 5), ratio preserved, but absolute gap halved. Repeated halving with shared loss events plus equal additive increase shrinks the gap toward zero. Chiu and Jain’s 1989 paper formalised this with a vector argument: AIMD’s trajectory in (flow1, flow2) space monotonically approaches the fairness line.

Common pitfalls#

  • Treating cwnd and rwnd as one thing. They are independent. min(cwnd, rwnd) is the send window. Confusing them means confusing receiver-driven backpressure with network-driven backoff.
  • Assuming loss means congestion. On wireless and lossy paths, link-layer noise causes loss without router queue buildup. Loss-based algorithms back off unnecessarily. BBR and ECN address this.
  • Ignoring slow start on short connections. A new TCP connection to a remote service may spend its entire life in slow start. The first ten round-trips of a 100ms-RTT path are 1 second of ramp-up. Keep-alive, HTTP/2, and connection pooling exist to amortise this.
  • Tuning tcp_congestion_control cargo-cult-style. Switching to BBR without measuring can hurt other flows on the path. Profile before changing.
  • Forgetting that congestion control is per-flow. Two TCP connections from the same client open two cwnd ramps — twice the bandwidth share of a single TCP-fair flow. HTTP/1.1’s six parallel connections per origin exploited this; HTTP/2 collapses it back to one (and so reverts to one TCP-fair share).
  • Not pacing high-bandwidth bursts. A 10 Gbps-capable host bursting its cwnd instantly can overrun shallow ingress queues at the bottleneck. Pacing (Linux fq qdisc, BBR’s built-in pacing) smooths this.
  • Confusing fast retransmit with fast recovery. Fast retransmit is “resend on 3 dup-ACKs.” Fast recovery is “after a fast retransmit, don’t drop cwnd to 1, halve it and continue.” Both are needed; they’re separate refinements.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.