TCP Flow Control and Window Scaling

The receive window, advertised window, zero-window probes, and the window-scaling option that fixed the 64KB limit.

Building Block Intermediate
8 min read
tcp flow-control window-scaling receive-window transport-layer

What it is#

TCP flow control is the receiver’s mechanism for telling the sender “slow down, you are filling my buffer faster than my application can drain it.” It is a strictly end-to-end concern — the receiver protecting itself — and is independent of congestion control, which protects the network. Both end up clamping the sender, but for different reasons; the sender obeys whichever is tighter.

The mechanism is dead simple. Every TCP segment carries a 16-bit window field — the number of bytes the receiver is willing to accept beyond the byte it just acknowledged. The sender keeps bytes_in_flight <= rwnd (the receive window) at all times. When the receiver’s buffer fills, it advertises rwnd = 0; the sender stops sending and waits.

The window-scaling option (RFC 7323) multiplies the advertised value by a power of 2 negotiated at handshake, allowing windows well past the 64 KB ceiling of a 16-bit field. Without it, modern bandwidth-delay products are unreachable.

When to use it#

Flow control is always on; you don’t opt in. What you do tune:

  • Per-socket buffer sizes (SO_RCVBUF, SO_SNDBUF). Setting these implicitly caps the receive window. Default Linux auto-tunes both; manual tuning helps for long-fat networks (high bandwidth, high RTT).
  • Window scaling. Enabled by default everywhere modern. Disabled only by very old peers or broken middleboxes that strip the option from the SYN.
  • Application-side recv cadence. A slow recv loop is what triggers flow control in the first place. If your service is back-pressured by downstream calls and falls behind on its TCP read, the kernel will advertise smaller windows — visible upstream as throughput collapsing.

Reach for flow-control debugging when:

  • A bulk transfer plateaus far below the link’s capacity. Probably bandwidth-delay product exceeding the negotiated window. Check ss -tani and look at rcv_wnd and snd_wnd.
  • You see ZeroWindow segments in tcpdump. Receiver is buffer-bound. Find out why the application isn’t draining.
  • A workload deadlocks waiting on TCP. Sender and receiver may both be in ESTABLISHED but neither is making progress — a classic “receiver is full and the application isn’t reading” symptom.

How it works#

The receive window#

Each TCP segment’s header has a 16-bit Window field. It means: “I (the sender of this segment) can receive Window more bytes beyond the byte my ACK acknowledges.” The receiver computes it from its socket-buffer free space:

rwnd = receive_buffer_size - bytes_buffered_not_yet_read

The sender keeps:

bytes_in_flight = bytes_sent - bytes_acked
allowed_in_flight = min(rwnd, cwnd) # cwnd is the congestion window

and stops sending when bytes_in_flight reaches allowed_in_flight. The window advances every time the receiver’s application reads from the socket, freeing buffer space.

Window scaling#

The 16-bit field caps at 65,535 bytes. On a 1 Gbps link with 100 ms RTT, the bandwidth-delay product is about 12.5 MB — orders of magnitude larger. Without help, TCP’s throughput would max out at 65535 / 0.1 = 640 KB/s per connection, less than 1% of the link.

The window-scale option is sent in the SYN and SYN-ACK only. It is a single byte S between 0 and 14; both sides multiply the wire’s 16-bit window field by 2^S to get the real window. With S = 14, the effective window can reach 1 GB. The option is negotiated once and applies for the connection’s lifetime — including on segments that don’t carry the option.

Zero-window probes#

When the receiver’s buffer is full, it advertises rwnd = 0. The sender stops. But if the segment that opens the window (sent after the receiver reads) is lost, both sides could deadlock forever — the sender doesn’t transmit because the last window was 0, the receiver doesn’t repeat because it has nothing new to send.

TCP breaks the deadlock with persist timer / zero-window probes: the sender periodically sends a 1-byte segment past the window edge. The receiver replies with an ACK that re-advertises the current window. If still 0, retry with exponential backoff (Linux starts around 200 ms and caps at ~120 s).

Silly window syndrome#

If the application reads tiny amounts at a time, the receiver could keep advertising tiny window updates and the sender could keep sending tiny segments. Header overhead dwarfs payload — “silly window syndrome.”

Fixes are on both sides:

  • Receiver-side (Clark’s rule): don’t advertise a window update until either half the buffer is free or one MSS is free.
  • Sender-side (Nagle’s algorithm): don’t send a small segment if there’s already unacked data outstanding — coalesce until the prior ACK comes back or there’s at least one MSS to send.

Nagle interacts badly with TCP delayed-ACK (the receiver delays the ACK up to 40 ms hoping to piggyback data). Together they can stall short interactive exchanges by an extra RTT. Disable Nagle on RPC and SSH sockets with TCP_NODELAY.

Flow control vs congestion control#

sender's allowed window = min(rwnd, cwnd)
| |
| +-- congestion window
| (sender protects the network)
+-- receive window
(receiver protects itself)

Same clamp shape, different signals. rwnd shrinks when the receiver’s app is slow. cwnd shrinks when the network drops packets or the sender’s congestion-control algorithm decides the path is congested. Both are recomputed continuously; the smaller one wins.

Variants#

  • Auto-tuning (Linux tcp_rmem). Modern kernels grow the receive buffer dynamically based on bandwidth-delay measurements. net.ipv4.tcp_rmem sets (min, default, max). Most workloads should leave this alone.
  • Window-scale ranges. Negotiated at handshake (0–14). Beware that asymmetric scaling is allowed — each side picks its own S.
  • DSACK (duplicate-SACK). Receiver tells the sender about duplicates it received. Helps the sender distinguish “I retransmitted unnecessarily” from “the network really lost it.” Affects flow-control behaviour indirectly via RTT estimation.
  • ECN (Explicit Congestion Notification). Not flow control proper, but interacts — receiver echoes ECN bits to the sender, which then shrinks cwnd without dropping a packet. Reduces queueing-induced flow-control oscillation.

Trade-offs#

Large receive buffers — fill a high-bandwidth-delay-product path, achieve line rate, hide jitter. Cost: more kernel memory per connection (multiplied across thousands of connections); slow receivers buffer more before backpressuring.
Small receive buffers — bounded memory per connection, faster backpressure to slow senders. Cost: cannot fill long-fat networks, throughput plateaus, sensitive to RTT spikes.

Other tensions:

  • Window scaling and middleboxes. If a middlebox along the path doesn’t understand wscale, the connection silently misbehaves. Modern paths are mostly clean; some enterprise networks aren’t.
  • Flow control vs head-of-line blocking. A single connection’s window covers all the bytes on that stream. HTTP/2 multiplexes streams over one TCP connection — a slow consumer on stream 3 can starve stream 5. HTTP/3 over QUIC dodges this with per-stream flow control.
  • Memory pressure vs throughput. On a high-fanout server with 100k connections, every extra megabyte of receive buffer per connection is 100 GB of RAM. Auto-tuning matters.
  • Persist timer and battery. Mobile clients sitting in zero-window for hours still send probes; on low-power devices this can prevent idle states. Apps usually close idle TCP rather than rely on the persist timer.
Why is the window 16 bits if everyone uses scaling?

TCP shipped in 1981; 64 KB was generous when ARPANET’s whole bandwidth was kilobits. By the 1990s long-fat networks made it inadequate. Window-scaling (RFC 1323, 1992) was the additive fix — keep the wire field, multiply by a per-connection shift negotiated at handshake. The original field stays as-is because TCP’s wire format is otherwise immovable; any other change would break every middlebox.

Common pitfalls#

  • Conflating flow control with congestion control. They are separate mechanisms with separate signals. Both clamp the sender. Flow control protects the receiver; congestion control protects the network.
  • Tuning SO_RCVBUF without measuring. Default auto-tuning is good. Manual buffer sizing helps only on known long-fat paths.
  • Disabling Nagle blindly. TCP_NODELAY is right for interactive RPCs (bytes per call < MSS, latency matters). For bulk transfer it just costs you packet efficiency.
  • Ignoring zero-window in production. A persistent ZeroWindow run from a backend is the network’s way of saying “your service isn’t reading from this socket.” Find the consumer that’s stalled.
  • Assuming wscale always works. Old hardware load balancers, transparent proxies, and some carrier-grade NATs strip the option. Check ss -tani’s wscale: field if a flow underperforms.
  • Receiver buffering forever. Out-of-order packets fill the receive buffer until the gap is filled. A pathological loss pattern can keep rwnd near zero indefinitely. SACK and fast retransmit are the relief valves.
  • Mistaking head-of-line blocking for flow control. They look similar (sender stalls, receiver doesn’t get data) but the cause differs — flow control is a window-clamp problem; HOL is an in-order-delivery problem inside an already-windowed connection.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.