TCP Fundamentals — Header, Handshake, Teardown

Three-way handshake, four-way close, sequence/ack numbers, the flags (SYN/ACK/FIN/RST/PSH/URG), what each enables.

Building Block Foundational
9 min read
tcp transport-layer handshake reliability

What it is#

TCP (Transmission Control Protocol) is the reliable, in-order, byte-stream transport protocol that has carried most of the Internet for forty years. From the application’s view, it offers a simple abstraction: open a connection to (IP, port), write bytes, read bytes, close. From the network’s view, it does enormous work — sequence numbers and acknowledgements to guarantee delivery, sliding windows for flow control, congestion-control algorithms that prevent collapse, three-way handshake to establish state, four-way teardown to close cleanly.

TCP sits above IP (which provides best-effort datagram delivery) and below the application layer. It multiplexes traffic between processes using 16-bit port numbers. A TCP connection is identified globally by the 5-tuple (src IP, src port, dst IP, dst port, protocol=TCP). Two parallel connections from the same browser to the same server have different source ports — that’s how the OS demultiplexes return traffic.

This writeup covers the surface that’s table stakes in interviews: the header layout, the three-way handshake, the four-way close, and what each flag does. Flow control, congestion control, and reliable-data-transfer mechanics have their own writeups.

When to use it#

Use TCP when you need any of:

  • Reliable, in-order delivery. Files, mail, web pages, database wire protocols, RPC frameworks (gRPC).
  • Stream semantics. You write a stream of bytes and the receiver reads the same stream — no datagram boundaries to manage.
  • Congestion-friendliness without writing it yourself. TCP backs off automatically when the network is congested. UDP-based protocols must implement their own.
  • NAT-friendliness. Stateful firewalls and NATs are tuned for TCP — connection tracking is straightforward.

Use UDP (or QUIC) instead when you need any of:

  • Low connection-setup cost. TCP costs ~1 RTT minimum to open; UDP is zero. TLS adds another RTT on top of TCP unless using TLS 1.3 resumption.
  • Independent message delivery. DNS queries, NTP, real-time audio/video where a late packet is worse than a lost one.
  • Multicast or broadcast. TCP can’t do either by definition (it’s connection-oriented).
  • Per-message ordering control. TCP forces in-order delivery — one lost packet blocks every later byte on the stream until retransmit.

How it works#

The TCP header#

20 bytes minimum, up to 60 with options.

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgement Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data | |U|A|P|R|S|F| |
| Offset| Reserved |R|C|S|S|Y|I| Window |
| | |G|K|H|T|N|N| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | Urgent Pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options (0 to 40 bytes) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields that matter most:

  • Source / destination port — 16 bits each. Combined with IPs they identify the connection.
  • Sequence number — 32 bits. Index of the first data byte in this segment, relative to the connection’s starting sequence (which is randomised per connection for security).
  • Acknowledgement number — 32 bits. The next sequence number this side expects to receive. Cumulative — ack N means “I’ve received everything up to byte N-1”.
  • Data offset — 4 bits. Header length in 32-bit words; up to 15 (so 60 bytes including options).
  • Flags — six one-bit fields: URG, ACK, PSH, RST, SYN, FIN.
  • Window — 16 bits. Advertised receive-window size in bytes (the receiver’s buffer space). Scaled by the window-scale option to break the 64KB ceiling.
  • Checksum — 16 bits over header + data + a pseudo-header from IP. Detects bit errors.
  • Options — MSS (max segment size, negotiated at handshake), window scale, SACK permitted, timestamps.

What each flag means#

  • SYN — “synchronize sequence numbers”. Set only on the first two segments of a connection (the SYN and the SYN-ACK).
  • ACK — “this segment carries a valid acknowledgement number”. Set on every segment after the first SYN.
  • FIN — “I have no more data to send”. Each direction is closed independently.
  • RST — “reset; this connection is aborted”. Sent when something is so wrong recovery isn’t worth it (port closed, protocol violation, application kill).
  • PSH — “push this data up to the application immediately, don’t buffer”. Mostly a hint these days.
  • URG — “the urgent pointer field is valid”. Almost never used in modern applications.

Three-way handshake#

client server
| -- SYN, seq=X, mss, wscale ---------------> |
| <- SYN, ACK, seq=Y, ack=X+1, mss, wscale -- |
| -- ACK, seq=X+1, ack=Y+1 -----------------> |
(connection established; data flows)

Step 1: client picks an initial sequence number X (randomised) and sends SYN seq=X. Step 2: server picks its own initial sequence Y and replies SYN-ACK seq=Y ack=X+1. Step 3: client acks the server’s SYN with ACK ack=Y+1. Both sides now know the other’s starting sequence and can compute future acks.

The handshake also negotiates options: MSS (so neither side sends segments larger than the other can buffer), window scaling, SACK support, and timestamps for RTT measurement and PAWS (protection against wrapped sequences).

Four-way teardown#

client server
| -- FIN, ACK ----------------------------> |
| <- ACK --------------------------------- | (server can still send data here)
| <- FIN, ACK ---------------------------- |
| -- ACK ---------------------------------> |
(connection closed; client enters TIME_WAIT)

Each direction closes independently: one side sends FIN, the other acks, then eventually sends its own FIN, the first side acks. The active closer (typically the client) enters TIME_WAIT for 2 * MSL (Maximum Segment Lifetime, conventionally 1-2 minutes on Linux). This guards against late segments from the old connection being delivered to a new connection that happens to reuse the same 5-tuple.

Connection states#

The classic TCP state machine has 11 states. The ones you’ll see in ss -tan or netstat:

  • LISTEN — server waiting for SYNs.
  • SYN_SENT — client sent SYN, waiting for SYN-ACK.
  • SYN_RECV — server got SYN, sent SYN-ACK, waiting for ACK. Half-open.
  • ESTABLISHED — handshake complete; data flows.
  • FIN_WAIT_1, FIN_WAIT_2, CLOSE_WAIT, CLOSING, LAST_ACK — various points of the teardown.
  • TIME_WAIT — active closer’s final wait.
  • CLOSED — gone.

Variants#

  • TCP with selective acknowledgement (SACK) — receiver acks non-contiguous ranges ("got 0-100 and 200-300, missing 100-200"), letting sender retransmit only the gaps. Enabled by default everywhere.
  • TCP Fast Open (TFO) — carries data in the SYN itself using a cookie from a prior connection. Saves an RTT for repeat clients. Adoption is patchy due to middlebox interference.
  • TCP keepalives — periodic probes to detect dead peers. Off by default; applications enable per-socket. Default intervals are absurdly long (2 hours on Linux) — tune them.
  • Multipath TCP (MPTCP) — splits one TCP connection across multiple paths (Wi-Fi + cellular). Used by iOS for Siri and AirPods handoff.
  • TCP BBR — Google’s bottleneck-bandwidth congestion-control algorithm. Doesn’t treat loss as the only congestion signal — measures bandwidth and RTT directly. Now a default on YouTube and parts of Google’s edge.
  • QUIC — strictly speaking not a TCP variant, but the practical successor. UDP-based, integrates TLS 1.3, per-stream loss recovery. Carries HTTP/3.

Trade-offs#

TCP — reliable in-order byte stream, automatic congestion handling, ubiquitous, NAT-friendly, well-tooled. Connection setup costs 1 RTT, head-of-line blocking across streams sharing one connection, TIME_WAIT can hurt high-fanout clients.
UDP — zero connection setup, no head-of-line blocking, message-oriented. You write reliability, ordering, and congestion control yourself (or use QUIC). Some firewalls and load balancers are UDP-hostile.

Other tensions inside TCP itself:

  • Reliability cost. Every byte must be ack’d; lost segments must be retransmitted; the OS must buffer in-flight data on both sides. Memory and bandwidth aren’t free — tcp_rmem / tcp_wmem tuning is real work at scale.
  • In-order delivery. A single lost packet blocks every later byte on the stream until retransmission completes. Painful for low-latency streaming inside one TCP connection. HTTP/2 over TCP inherits this; HTTP/3 over QUIC dodges it.
  • Slow start. Each new TCP connection starts conservatively and ramps up. For short-lived connections (one HTTP/1.1 request), slow start dominates the timing. Keep-alive and HTTP/2 multiplexing both attack this.
  • Random initial sequence numbers. Prevent off-path attackers from injecting segments. But middlebox sequence-number rewriting (some NATs) breaks SACK and timestamps.
Why the handshake is three-way and not two

Two messages (client SYN, server SYN-ACK) only confirm one direction. The third message (client ACK) is what tells the server “I received your SYN-ACK; we agree on starting sequences both ways”. Without it, a stale duplicate SYN from an old connection could trick the server into committing to a session no client is actually opening. The third ACK is the cheap insurance against that.

Common pitfalls#

  • Treating TCP as message-oriented. It is a byte stream. Two write() calls of 100 bytes each may arrive as one 200-byte read or two 100-byte reads or 50+150 — you must frame messages yourself (length prefix, delimiter, or fixed-size).
  • Assuming close() flushes. It does for the OS buffer, but if the peer never reads, the FIN may linger in LAST_ACK and your buffers fill. SO_LINGER controls this; defaults are usually fine.
  • Confusing flow control with congestion control. Flow control is the receive-window protecting the receiver. Congestion control is cwnd protecting the network. They’re independent mechanisms; the sender uses min(rwnd, cwnd).
  • Ignoring Nagle and delayed-ACK interaction. Nagle batches small sends; delayed-ACK batches small acks. Together they can add 40 ms of latency to interactive traffic. TCP_NODELAY disables Nagle for RPCs and SSH.
  • Forgetting that RST is asynchronous. A peer that crashes or kills the connection sends RST; the local side may not learn until the next read or write returns ECONNRESET. Connection-pool health checks help.
  • Tuning kernel knobs cargo-cult-style. net.core.somaxconn, tcp_max_syn_backlog, tcp_fin_timeout, tcp_keepalive_time: measure before changing. Defaults are reasonable on modern Linux.
  • Believing the RTT estimator immediately. Karn’s algorithm ignores RTT samples from retransmitted segments. Early in a connection, the smoothed RTT is noisy — initial retransmission timeouts may fire too eagerly.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.