TCP Fundamentals — Header, Handshake, Teardown
Three-way handshake, four-way close, sequence/ack numbers, the flags (SYN/ACK/FIN/RST/PSH/URG), what each enables.
What it is#
TCP (Transmission Control Protocol) is the reliable, in-order, byte-stream transport protocol that has carried most of the Internet for forty years. From the application’s view, it offers a simple abstraction: open a connection to (IP, port), write bytes, read bytes, close. From the network’s view, it does enormous work — sequence numbers and acknowledgements to guarantee delivery, sliding windows for flow control, congestion-control algorithms that prevent collapse, three-way handshake to establish state, four-way teardown to close cleanly.
TCP sits above IP (which provides best-effort datagram delivery) and below the application layer. It multiplexes traffic between processes using 16-bit port numbers. A TCP connection is identified globally by the 5-tuple (src IP, src port, dst IP, dst port, protocol=TCP). Two parallel connections from the same browser to the same server have different source ports — that’s how the OS demultiplexes return traffic.
This writeup covers the surface that’s table stakes in interviews: the header layout, the three-way handshake, the four-way close, and what each flag does. Flow control, congestion control, and reliable-data-transfer mechanics have their own writeups.
When to use it#
Use TCP when you need any of:
- Reliable, in-order delivery. Files, mail, web pages, database wire protocols, RPC frameworks (gRPC).
- Stream semantics. You write a stream of bytes and the receiver reads the same stream — no datagram boundaries to manage.
- Congestion-friendliness without writing it yourself. TCP backs off automatically when the network is congested. UDP-based protocols must implement their own.
- NAT-friendliness. Stateful firewalls and NATs are tuned for TCP — connection tracking is straightforward.
Use UDP (or QUIC) instead when you need any of:
- Low connection-setup cost. TCP costs
~1RTT minimum to open; UDP is zero. TLS adds another RTT on top of TCP unless using TLS 1.3 resumption. - Independent message delivery. DNS queries, NTP, real-time audio/video where a late packet is worse than a lost one.
- Multicast or broadcast. TCP can’t do either by definition (it’s connection-oriented).
- Per-message ordering control. TCP forces in-order delivery — one lost packet blocks every later byte on the stream until retransmit.
How it works#
The TCP header#
20 bytes minimum, up to 60 with options.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Source Port | Destination Port |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Sequence Number |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Acknowledgement Number |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Data | |U|A|P|R|S|F| || Offset| Reserved |R|C|S|S|Y|I| Window || | |G|K|H|T|N|N| |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Checksum | Urgent Pointer |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Options (0 to 40 bytes) |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Data |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+The fields that matter most:
- Source / destination port — 16 bits each. Combined with IPs they identify the connection.
- Sequence number — 32 bits. Index of the first data byte in this segment, relative to the connection’s starting sequence (which is randomised per connection for security).
- Acknowledgement number — 32 bits. The next sequence number this side expects to receive. Cumulative — ack
Nmeans “I’ve received everything up to byteN-1”. - Data offset — 4 bits. Header length in 32-bit words; up to 15 (so 60 bytes including options).
- Flags — six one-bit fields: URG, ACK, PSH, RST, SYN, FIN.
- Window — 16 bits. Advertised receive-window size in bytes (the receiver’s buffer space). Scaled by the window-scale option to break the 64KB ceiling.
- Checksum — 16 bits over header + data + a pseudo-header from IP. Detects bit errors.
- Options — MSS (max segment size, negotiated at handshake), window scale, SACK permitted, timestamps.
What each flag means#
- SYN — “synchronize sequence numbers”. Set only on the first two segments of a connection (the SYN and the SYN-ACK).
- ACK — “this segment carries a valid acknowledgement number”. Set on every segment after the first SYN.
- FIN — “I have no more data to send”. Each direction is closed independently.
- RST — “reset; this connection is aborted”. Sent when something is so wrong recovery isn’t worth it (port closed, protocol violation, application kill).
- PSH — “push this data up to the application immediately, don’t buffer”. Mostly a hint these days.
- URG — “the urgent pointer field is valid”. Almost never used in modern applications.
Three-way handshake#
client server | -- SYN, seq=X, mss, wscale ---------------> | | <- SYN, ACK, seq=Y, ack=X+1, mss, wscale -- | | -- ACK, seq=X+1, ack=Y+1 -----------------> | (connection established; data flows)Step 1: client picks an initial sequence number X (randomised) and sends SYN seq=X. Step 2: server picks its own initial sequence Y and replies SYN-ACK seq=Y ack=X+1. Step 3: client acks the server’s SYN with ACK ack=Y+1. Both sides now know the other’s starting sequence and can compute future acks.
The handshake also negotiates options: MSS (so neither side sends segments larger than the other can buffer), window scaling, SACK support, and timestamps for RTT measurement and PAWS (protection against wrapped sequences).
Four-way teardown#
client server | -- FIN, ACK ----------------------------> | | <- ACK --------------------------------- | (server can still send data here) | <- FIN, ACK ---------------------------- | | -- ACK ---------------------------------> | (connection closed; client enters TIME_WAIT)Each direction closes independently: one side sends FIN, the other acks, then eventually sends its own FIN, the first side acks. The active closer (typically the client) enters TIME_WAIT for 2 * MSL (Maximum Segment Lifetime, conventionally 1-2 minutes on Linux). This guards against late segments from the old connection being delivered to a new connection that happens to reuse the same 5-tuple.
Connection states#
The classic TCP state machine has 11 states. The ones you’ll see in ss -tan or netstat:
LISTEN— server waiting for SYNs.SYN_SENT— client sent SYN, waiting for SYN-ACK.SYN_RECV— server got SYN, sent SYN-ACK, waiting for ACK. Half-open.ESTABLISHED— handshake complete; data flows.FIN_WAIT_1,FIN_WAIT_2,CLOSE_WAIT,CLOSING,LAST_ACK— various points of the teardown.TIME_WAIT— active closer’s final wait.CLOSED— gone.
Variants#
- TCP with selective acknowledgement (SACK) — receiver acks non-contiguous ranges (
"got 0-100 and 200-300, missing 100-200"), letting sender retransmit only the gaps. Enabled by default everywhere. - TCP Fast Open (TFO) — carries data in the SYN itself using a cookie from a prior connection. Saves an RTT for repeat clients. Adoption is patchy due to middlebox interference.
- TCP keepalives — periodic probes to detect dead peers. Off by default; applications enable per-socket. Default intervals are absurdly long (2 hours on Linux) — tune them.
- Multipath TCP (MPTCP) — splits one TCP connection across multiple paths (Wi-Fi + cellular). Used by iOS for Siri and AirPods handoff.
- TCP BBR — Google’s bottleneck-bandwidth congestion-control algorithm. Doesn’t treat loss as the only congestion signal — measures bandwidth and RTT directly. Now a default on YouTube and parts of Google’s edge.
- QUIC — strictly speaking not a TCP variant, but the practical successor. UDP-based, integrates TLS 1.3, per-stream loss recovery. Carries HTTP/3.
Trade-offs#
Other tensions inside TCP itself:
- Reliability cost. Every byte must be ack’d; lost segments must be retransmitted; the OS must buffer in-flight data on both sides. Memory and bandwidth aren’t free —
tcp_rmem/tcp_wmemtuning is real work at scale. - In-order delivery. A single lost packet blocks every later byte on the stream until retransmission completes. Painful for low-latency streaming inside one TCP connection. HTTP/2 over TCP inherits this; HTTP/3 over QUIC dodges it.
- Slow start. Each new TCP connection starts conservatively and ramps up. For short-lived connections (one HTTP/1.1 request), slow start dominates the timing. Keep-alive and HTTP/2 multiplexing both attack this.
- Random initial sequence numbers. Prevent off-path attackers from injecting segments. But middlebox sequence-number rewriting (some NATs) breaks SACK and timestamps.
Why the handshake is three-way and not two
Two messages (client SYN, server SYN-ACK) only confirm one direction. The third message (client ACK) is what tells the server “I received your SYN-ACK; we agree on starting sequences both ways”. Without it, a stale duplicate SYN from an old connection could trick the server into committing to a session no client is actually opening. The third ACK is the cheap insurance against that.
Common pitfalls#
- Treating TCP as message-oriented. It is a byte stream. Two
write()calls of 100 bytes each may arrive as one 200-byte read or two 100-byte reads or 50+150 — you must frame messages yourself (length prefix, delimiter, or fixed-size). - Assuming
close()flushes. It does for the OS buffer, but if the peer never reads, the FIN may linger inLAST_ACKand your buffers fill.SO_LINGERcontrols this; defaults are usually fine. - Confusing flow control with congestion control. Flow control is the receive-window protecting the receiver. Congestion control is
cwndprotecting the network. They’re independent mechanisms; the sender usesmin(rwnd, cwnd). - Ignoring Nagle and delayed-ACK interaction. Nagle batches small sends; delayed-ACK batches small acks. Together they can add 40 ms of latency to interactive traffic.
TCP_NODELAYdisables Nagle for RPCs and SSH. - Forgetting that RST is asynchronous. A peer that crashes or kills the connection sends RST; the local side may not learn until the next read or write returns ECONNRESET. Connection-pool health checks help.
- Tuning kernel knobs cargo-cult-style.
net.core.somaxconn,tcp_max_syn_backlog,tcp_fin_timeout,tcp_keepalive_time: measure before changing. Defaults are reasonable on modern Linux. - Believing the RTT estimator immediately. Karn’s algorithm ignores RTT samples from retransmitted segments. Early in a connection, the smoothed RTT is noisy — initial retransmission timeouts may fire too eagerly.
Related building blocks#