Abstractions in Distributed Systems — System Design

Summary#

Distributed-systems vocabulary is overloaded. The same five or six words — node, process, service, instance, replica, cluster — mean subtly different things to different interviewers, and the wrong-sized abstraction makes the rest of the design wrong. This page nails down the unit boundaries you’ll be drawing on a whiteboard and the failure scope that comes with each one.

Why it matters#

Pick the wrong abstraction and your trade-offs are about the wrong thing. A candidate who says “I’ll deploy three replicas” without specifying whether those replicas are processes on one host, containers in one zone, or VMs in three regions has not actually answered the availability question — they’ve only used the word.

Sharper interviewers will deliberately use a fuzzy word (“we have a worker”) and watch whether the candidate asks back. The clarification — “worker process, worker host, or worker pool?” — is itself the signal.

How it works#

Walk the ladder from smallest unit to largest. Every step up is a new failure domain, a new latency floor, and a new coordination cost.

Process#

A single OS process. Shares memory, has one IP and port. Crashes are atomic — either the process is up or it’s down. The failure domain is “this process”. Within a process you have threads, but threads are usually invisible at interview altitude.

Node (host / machine / VM)#

One physical or virtual machine running one or more processes. Adds the hardware failure domain: disk, NIC, kernel panic, power. A “node” in a database paper almost always means one host running one database process — be careful when a paper says “3 nodes” and you’ve drawn 3 processes on one host.

Service#

A logically named cluster of identical processes behind a stable address (DNS name, virtual IP, mesh identity). The user of a service does not know how many processes are behind it. The service contract is the API; the implementation is replaceable.

Replica#

A copy of state for redundancy. Replicas are deliberately not independent — they hold the same data, so a logical bug propagates to all of them. Three replicas of the same Postgres in the same zone tolerate one machine failure, not one bug.

Shard / Partition#

A horizontal slice of data, usually identified by a key range or hash. Each shard is independently writable. A shard is a scale unit; a replica is a redundancy unit. Most interview answers conflate them.

Cluster#

A managed group of nodes that coordinate (consensus, gossip, leader election). “Cluster” usually implies an internal membership protocol — Kafka cluster, Cassandra ring, Kubernetes cluster.

Zone / Region#

Physical isolation domains. A zone is one datacenter (independent power, cooling, network); a region is a geographic grouping of zones (independent fiber paths, sometimes independent legal jurisdictions). Cross-zone latency is sub-millisecond; cross-region is 30–150 ms depending on the pair.

Variants and trade-offs#

Coarse abstractions (one box per service) — fast to draw, easy to discuss at the architecture level, but they hide the question of “how many of these, where”. Good for the first 5 minutes of high-level design.

Fine abstractions (process / node / zone explicit) — required for any availability, consistency, or failover discussion. Slow to draw and easy to over-clutter, so only zoom in where the failure mode matters.

Logical vs physical is the other axis. A “primary database” is a logical role — at any moment exactly one process holds it. The physical primary changes during failover. Confusing the two leads to claims like “we lose data when the primary dies” that aren’t quite right (you lose data only if unreplicated writes were acknowledged) and aren’t quite wrong either.

The 'stateless service' trick

Senior candidates almost always describe application tiers as “stateless services” without qualifying what state means. Connection pools, in-memory caches, sticky-session affinity, and JWT verification keys are all state. A truly stateless service is rarer than the term suggests — but the abstraction is useful because it tells the interviewer “I’m asserting this layer is horizontally scalable without coordination”.

When this is asked in interviews#

Rarely as a direct question. Almost always as a follow-up trap. The interviewer asks “what happens when the database goes down” and the candidate says “we failover to the replica”. The trap is: which replica, in which zone, with what lag, and who decides? If the candidate has been sloppy about abstractions earlier in the interview, this is where it surfaces.

Most common at SDE-2 → senior boundary at FAANG-equivalents and at infrastructure-heavy shops (Stripe, Cloudflare, Databricks, Datadog). Platform-team interviews care more than product-team interviews — a payments platform engineer is expected to distinguish a region from an availability zone without prompting; a feature-team frontend engineer often isn’t.

Common follow-ups:

“How many nodes per zone? Per region? Why those numbers?”
“If the zone holding the leader is partitioned, what happens?”
“Is this service truly stateless, or is there state you’re not drawing?”
“What’s the smallest unit that can fail independently here?”