TCP Congestion Control — AIMD, Slow Start, Fast Retransmit
The four canonical algorithms (slow start, congestion avoidance, fast retransmit, fast recovery) and why AIMD converges to fairness.
What it is#
TCP congestion control is the sender’s mechanism for not overwhelming the network — distinct from flow control, which protects the receiver. The sender maintains a congestion window cwnd, in bytes, alongside the receive window rwnd; the effective send window is min(rwnd, cwnd). cwnd ramps up when packets are getting through and shrinks when loss appears. The dance of growth and contraction is what keeps the Internet from collapsing.
Four canonical algorithms, layered: slow start (probe how fast the path can carry traffic), congestion avoidance (linear additive increase past a threshold), fast retransmit (resend on three duplicate ACKs without waiting for the timer), fast recovery (don’t drop cwnd all the way to 1 after a fast retransmit — halve it and continue). Together they form Reno-style AIMD: Additive Increase, Multiplicative Decrease. Modern Linux defaults to CUBIC, which keeps the shape but uses a cubic function instead of linear growth. Google’s BBR replaces the loss-based model entirely with explicit bandwidth and RTT measurement.
When to use it#
Congestion control is always on; you don’t disable it. What you choose:
- The algorithm. Linux exposes
net.ipv4.tcp_congestion_control—cubic(default),bbr,reno,vegas, others. Pickbbrfor high-throughput long-fat paths (Google’s edge, CDN cross-region);cubicis fine for most things. - Initial congestion window. RFC 6928 raised the default to 10 segments. CDN edges sometimes push it higher to skip slow start for short HTTP responses. Trade-off: bursty pacing risks tail drops at routers with shallow queues.
- Pacing. Linux can pace
cwndover the RTT (sk_pacing_rate) instead of bursting. BBR and Fair-Queue scheduling depend on this; CUBIC tolerates either.
When to actually think about it:
- Long-fat networks (high bandwidth × high RTT). Single-flow throughput is limited by
cwnd / RTT. CUBIC reaches a 1 GB window slowly; BBR’s bandwidth-probing model fills the pipe faster. - Lossy wireless paths. Reno/CUBIC interpret loss as congestion and back off. On Wi-Fi or LTE, loss is often link-layer noise, not router queues — backing off costs throughput. BBR (loss-tolerant) and link-layer ARQ both help.
- Cross-region replication, backups, video. Anywhere a single flow needs to push tens of MB/s for minutes. Algorithm choice matters.
How it works#
Slow start#
A new connection has no idea what the path can handle. Start cwnd = 10 * MSS (recent default); double cwnd every RTT (one additional MSS for every ACK received) until either a loss occurs or cwnd reaches ssthresh (slow-start threshold).
cwnd over RTTs during slow start:RTT 1: 10RTT 2: 20RTT 3: 40RTT 4: 80...Despite the name, slow start is exponential — fast in the steady state. “Slow” is relative to the historical alternative of opening with a huge cwnd and overrunning the path.
Congestion avoidance#
Once cwnd >= ssthresh, switch to linear growth — add one MSS per RTT (one MSS / cwnd ACK in segments). This is the “additive increase” of AIMD. The sender now grows cautiously, probing for capacity rather than rushing.
Loss detection and reaction#
Two ways to detect loss:
- Retransmission timeout (RTO). The catastrophic case. Set
ssthresh = cwnd / 2, resetcwnd = 1 * MSS, return to slow start. Costly — every connection that experiences an RTO loses minutes to ramp back up. - Three duplicate ACKs (fast retransmit). The lighter touch. Set
ssthresh = cwnd / 2, setcwnd = ssthresh + 3 * MSS(count the three dup-ACKs as packets still in flight), retransmit the missing segment, enter fast recovery. After the lost segment is acked,cwnd = ssthreshand resume congestion avoidance.
This pair — halve on duplicate ACK, drop to 1 on RTO — is the “multiplicative decrease” of AIMD.
The AIMD sawtooth#
cwnd | | /\ /\ /\ | / \ / \ / \ | / \ / \ / \ | / \ / \ / \ | / \ / \ / \ | / \/ \/ \ | / |---+----------------------------------------> time slow start cong. avoidance loss → halveLinear up, halve on loss. The shape repeats; the sender lives at roughly 75% of bottleneck capacity on average.
Why AIMD converges to fairness#
Two flows sharing a bottleneck: each grows additively when there’s no loss; both halve when loss occurs. Plot one flow’s cwnd against the other’s. Additive increase moves the point along a 45° line; multiplicative decrease moves it along a line through the origin. The intersection of those two motions converges to the fair-share point — equal cwnd for both flows. This is the classical Chiu-Jain result.
flow 2 cwnd ^ | . . . . . fairness line (cwnd1 == cwnd2) | / ↗ | / ↗ | / ↗ ← additive increase: equal slope, parallel to 45° |/ ↘ ← multiplicative decrease: toward origin +----------> flow 1 cwndMultiplicative-increase / additive-decrease would diverge. Additive-increase / multiplicative-decrease converges. That’s the deep reason TCP uses AIMD specifically.
Fast retransmit and fast recovery#
Loss recovery used to be “wait for the RTO, retransmit, restart slow-start.” Fast retransmit (Reno) added: if three duplicate ACKs arrive, retransmit immediately — the receiver got later segments but not this one, so it’s almost certainly lost.
Fast recovery (NewReno, later default) refined the reaction: don’t drop cwnd to 1. Halve ssthresh, set cwnd = ssthresh + 3 to account for the dup-ACKs, retransmit, and stay in congestion avoidance. Far less throughput hit than a full RTO recovery.
Variants#
- TCP Reno — original AIMD with fast retransmit / fast recovery. Reference implementation; rarely the default today.
- TCP NewReno (RFC 6582) — fixes Reno’s behaviour when multiple segments are lost in one window. Still cumulative-ACK only.
- TCP CUBIC — Linux default since 2.6.19.
cwndgrows as a cubic function of time since the last loss. Plateaus near the previous high-water mark, then probes past it. Better than Reno for high-bandwidth-delay-product paths. - TCP BBR (Bottleneck Bandwidth and Round-trip propagation time) — Google’s loss-independent algorithm. Estimates bottleneck bandwidth (max delivery rate seen) and minimum RTT, sets
cwndto keep the pipe full without overflowing the bottleneck queue. Now widespread on Google services and YouTube. - TCP Vegas — RTT-based. Senses queueing by RTT inflation, not by loss. Falls behind in mixed deployments with loss-based flows (which take more share).
- DCTCP (Data Center TCP) — for datacenter switches with ECN marking. Uses ECN as a signal to ramp
cwnddown proportionally rather than halving. Lower latency in datacenter fabrics. - TCP Westwood / Westwood+ — wireless-friendly; estimates available bandwidth from ACK rate and uses that to set
ssthreshafter loss.
Trade-offs#
Other tensions:
- Aggressiveness vs fairness. A more aggressive algorithm gets more throughput in mixed deployments — at the cost of unfairness. CUBIC was a measured step past Reno; BBR is a step further. Standards bodies care about “TCP-friendly” behaviour.
- Loss-based vs delay-based vs model-based. Loss-based (Reno, CUBIC): packet drops are the signal. Delay-based (Vegas): RTT inflation. Model-based (BBR): direct bandwidth + RTT measurement. Each has different deployment trade-offs.
- Initial cwnd. Larger
IWskips slow start for short responses — wins on HTTP. Too large and a burst from a CDN edge overruns shallow router queues, causing tail drops that hurt page-load time. RFC 6928’s 10 is the current consensus. - Pacing vs bursting. A bursty sender transmits its window worth of segments back-to-back. A pacing sender spreads them over the RTT. Pacing reduces microburst loss; some legacy NICs and middleboxes don’t behave well under it.
Why exactly does multiplicative decrease cause convergence?
When you halve a flow’s cwnd, you halve its share of the bottleneck. Two unequal flows that halve simultaneously get closer in absolute terms — (20, 10) becomes (10, 5), ratio preserved, but absolute gap halved. Repeated halving with shared loss events plus equal additive increase shrinks the gap toward zero. Chiu and Jain’s 1989 paper formalised this with a vector argument: AIMD’s trajectory in (flow1, flow2) space monotonically approaches the fairness line.
Common pitfalls#
- Treating cwnd and rwnd as one thing. They are independent.
min(cwnd, rwnd)is the send window. Confusing them means confusing receiver-driven backpressure with network-driven backoff. - Assuming loss means congestion. On wireless and lossy paths, link-layer noise causes loss without router queue buildup. Loss-based algorithms back off unnecessarily. BBR and ECN address this.
- Ignoring slow start on short connections. A new TCP connection to a remote service may spend its entire life in slow start. The first ten round-trips of a 100ms-RTT path are 1 second of ramp-up. Keep-alive, HTTP/2, and connection pooling exist to amortise this.
- Tuning
tcp_congestion_controlcargo-cult-style. Switching to BBR without measuring can hurt other flows on the path. Profile before changing. - Forgetting that congestion control is per-flow. Two TCP connections from the same client open two
cwndramps — twice the bandwidth share of a single TCP-fair flow. HTTP/1.1’s six parallel connections per origin exploited this; HTTP/2 collapses it back to one (and so reverts to one TCP-fair share). - Not pacing high-bandwidth bursts. A 10 Gbps-capable host bursting its
cwndinstantly can overrun shallow ingress queues at the bottleneck. Pacing (Linuxfqqdisc, BBR’s built-in pacing) smooths this. - Confusing fast retransmit with fast recovery. Fast retransmit is “resend on 3 dup-ACKs.” Fast recovery is “after a fast retransmit, don’t drop
cwndto 1, halve it and continue.” Both are needed; they’re separate refinements.
Related building blocks#