Distributed Systems — Communication Basics — Operating Systems

Summary#

Distributed communication starts from one uncomfortable fact: the network is unreliable. Packets get dropped, duplicated, reordered, and delayed by arbitrary amounts. The machine on the other end can crash, partition away, or sit silently with a queue full. Nothing in the lower layers (Ethernet, IP, even TCP) saves you — TCP papers over the unreliability for one connection but says nothing about what happens when a connection dies mid-request.

Everything else in distributed systems — RPC, replication, consensus, distributed file systems, distributed transactions — is layered scaffolding on top of this one fact. The patterns you reach for repeatedly are timeouts, retries, idempotent operations, exactly-once / at-most-once / at-least-once semantics, and (often) a stateless protocol so the recovery story stays tractable.

Why it matters#

Local function calls always succeed or throw. Remote calls have a third outcome: no response. That single asymmetry breaks the abstraction that distributed objects can be treated like local objects (the “RPC is just a function call” school, debunked by Waldo et al. 1994 in A Note on Distributed Computing). When a client times out waiting for a response, it cannot distinguish:

The request never reached the server.
The server processed it and crashed before replying.
The server processed it, replied, and the reply was lost.
The server is alive, the reply is in flight, the timeout just fired too early.

Different recovery actions are correct for each case, but the client cannot tell them apart. The systems that work assume the worst and design for retries to be safe.

How it works#

Layers of unreliability#

Datagram (UDP) — fire and forget. No ordering, no delivery, no duplicate detection. Cheap.
Reliable byte stream (TCP) — in-order, no duplicates on one connection. Says nothing about endpoint crashes, application-level message framing, or what happens when the connection itself dies.
RPC / request-response — application-level “send a message, get a reply.” Layered on top of either UDP (NFS v2/v3) or TCP (gRPC, HTTP/2). This is where retries and idempotency become the operator’s problem.

TCP is good enough for stream-of-bytes between two endpoints. The moment you have request semantics or multiple endpoints, you’re back to first principles.

The four delivery semantics#

exactly-once   — guaranteed to execute once, no more, no less.  (impossible in general)
at-least-once  — executes one or more times; retries until acked. Requires idempotent ops.
at-most-once   — executes zero or one times; no retry. Fast, but loses requests under failure.
best-effort    — no guarantee.                                    (UDP for telemetry)

True exactly-once is impossible over an asynchronous network (the FLP-impossibility-adjacent result). What systems call “exactly-once” is really “at-least-once delivery + idempotent processing + deduplication.” Kafka exactly-once, Stripe idempotency keys, AWS SQS “exactly-once” — all variations on this theme.

Idempotency as the central virtue#

An operation is idempotent if doing it twice has the same effect as doing it once. PUT /users/123 { name: "Alice" } is idempotent. POST /users { name: "Alice" } (creates a new user with a fresh ID) is not. withdraw($100) is not. set_balance($500) is.

Idempotency lets you retry without thinking. If you design every RPC to be idempotent, the recovery story collapses to a loop:

loop {
    send request
    if timeout: continue   // safe to retry
    if reply received: done
}

When the underlying operation isn’t naturally idempotent, you make it so by attaching a client-generated request ID and tracking which IDs the server has already processed. Stripe famously requires an Idempotency-Key header on POST /charges — if the same key comes in twice, the second call returns the first call’s result.

Timeouts and retries#

A request without a timeout will eventually be wedged forever on a broken connection. Every RPC needs a deadline. The trade-off: too short, you retry requests that would have completed; too long, you keep a thread/connection tied up waiting for a server that’s gone.

Common patterns:

Bounded retry with exponential backoff — wait 100 ms, 200 ms, 400 ms, 800 ms, … up to a cap. Adds jitter (±50% randomization) to avoid synchronized thundering herds.
Hedged requests — after the median latency elapses, fire a second request to a different replica; use whichever returns first. Trades extra load for tail-latency reduction.
Circuit breakers — after N consecutive failures, stop sending to that target for a window. Lets a struggling backend recover instead of getting hammered.

RPC frameworks#

The textbook RPC stack:

client stub -> marshal args -> send over wire -> unmarshal on server
                                                       -> dispatch to handler
                                                       -> marshal reply
                                            <- unmarshal reply -> return to caller

Real-world incarnations: Sun RPC (NFS), DCE RPC (Windows), CORBA (failed), Java RMI (failed), Thrift (Facebook), gRPC (Google, on HTTP/2 + Protobuf), Cap’n Proto, JSON-RPC. The wire format changes; the principles don’t.

Stateless vs. stateful protocols#

A stateless protocol carries everything the server needs in each request — no client-side memory on the server. Restart the server mid-session and the next request still works. The textbook example is NFS v2/v3: every read carries (file_handle, offset, length); the server replies; nobody remembers anything across calls. This is what makes NFS robust to server reboot — clients just retry.

A stateful protocol keeps session state on the server (open file descriptors, sequence numbers, in-flight transactions). Faster (no need to re-establish state per call) but the failure story is harder — the server losing state means clients see “session expired” or worse.

Variants and trade-offs#

Stateless + idempotent — server holds no per-client memory; every request is self-contained and safe to retry. Pioneered by NFS; lives on in REST, S3, most cloud APIs. Robust to server reboot; expensive when state per request is large; cache consistency becomes the client’s problem.

Stateful + reliable transport — server tracks per-client session; client sends deltas; server holds open files, write locks, transactions. Faster per request when state is large; failure recovery requires session migration or re-establishment. AFS callbacks, SMB, modern databases.

Other axes:

Push vs. pull. Pull (poll) is simple but wasteful and laggy. Push (callbacks, long-poll, server-sent events, WebSockets) is efficient but requires the server to track who to notify — extra state.
Synchronous vs. asynchronous RPC. Sync blocks the caller; async takes a callback or returns a future. Async is essential when one logical operation involves many remote calls.
One server vs. many. Once you have replicas, you need a strategy: pick one, fan out to all, quorum to some. Each has its own consistency and latency profile.

Failure detection#

You can’t tell the difference between “slow” and “dead” without a timeout, and you can’t tune the timeout perfectly. Practical detectors:

Heartbeats — ping every T seconds; declare dead after K missed pings. Easy; lots of false positives under load.
Phi-accrual — adaptive heartbeat that returns a “suspicion level” rather than a binary live/dead. Cassandra uses this.
External coordinator — Zookeeper / etcd / Consul holds ephemeral session nodes; if the session expires, the node is dead. Centralises the detection problem.

When this is asked in interviews#

This concept underlies almost every distributed-systems interview. Patterns:

Foundational — “What’s the difference between at-least-once and at-most-once?” Tests vocabulary. Bonus: “Why is exactly-once impossible?”
Mid-level — “Design an API that survives client retries safely.” Answer: idempotency key, request ID dedup, version vector.
Senior — “How do you tune timeouts in a fan-out service?” Tests understanding of tail latency, hedging, circuit breakers.
Staff — “Walk me through what happens when a client times out a write to a quorum-replicated database.” Tests the whole quorum + retry + dedup interplay.

A second context: post-incident reviews. “Did this retry storm make the outage worse?” Yes, almost always — unbounded retries amplify failure into outage. The right framing is “every retry needs backoff, jitter, and a circuit breaker.”

Why does TCP not solve this problem?

TCP guarantees in-order, no-duplicate delivery as long as the connection stays open. When the connection dies — peer crashes, switch reboots, NAT timeout — TCP can’t tell you whether your last write made it. The receiver-side application may have processed it; or may have crashed mid-processing; or may have crashed before reading it from the socket buffer. The retransmit window only covers bytes the OS hasn’t acked; it doesn’t cover application-level acknowledgement. Any RPC layered on TCP still needs an application-level idempotency story.

NFS — Network File System — stateless RPC over UDP, the canonical example.
AFS — Andrew File System — the stateful alternative with callbacks.
Concurrency Bugs — Deadlock, Atomicity, Order — distributed bugs are concurrency bugs with worse failure modes.
Threads and Shared State — the local-machine companion problem.
POSIX Threads API — what you reach for inside each node.