Reliability — System Design · Engineering Playbook

Summary#

Reliability is the probability that the system does the right thing over a period of time, under expected operating conditions. It’s a strict superset of availability: a system that’s up but returning wrong answers is available, not reliable. The reliability question is “when this serves a request, is the answer correct, complete, and durable?”

Why it matters#

Candidates conflate availability and reliability constantly, and senior interviewers wait for it. “Highly available” describes uptime; “highly reliable” describes whether what came back was right. A read replica serving stale-by-five-minutes data is available — you got a response — but unreliable for any workflow that needs current state.

The cost shapes are also different. Adding redundancy raises availability cheaply. Raising reliability often means adding work — checksums, durable acks, idempotency tracking, end-to-end verification — which costs latency and throughput on every request, not just during incidents.

How it works#

Three sub-dimensions, each with its own knob:

Correctness#

Does the system return the right answer for the inputs given? Correctness failures look like: arithmetic bugs, race conditions, stale-read responses to “show me my latest order”, phantom writes, lost updates. Defended by: tests, formal methods, idempotency keys, transactions, reconciliation jobs, end-to-end auditing.

Durability#

Once the system has acknowledged a write, the write survives. Durability failures look like: data loss after a crash before fsync, replication lag during failover, accidental deletes, corrupt backups. Quoted in nines, same as availability — “11 nines” (99.999999999%) is the standard cloud-object-store claim. Durability and availability are orthogonal: a system can be 99.9999999999% durable but offline for an hour during a failover.

Continuity#

The system keeps working across failures, not just up to them. Continuity failures look like: cache stampede after a restart, half-committed transactions after a partition, “ghost” sessions after an auth-service outage, in-flight messages lost during a queue migration. Defended by: idempotent operations, durable journals, replay-on-startup, idempotency keys, exactly-once delivery semantics.

The MTBF / MTTR framing#

MTBF (mean time between failures) — average operating time before a failure event. Bigger is better. Hardware MTBF is published; software MTBF is whatever your last quarter says.
MTTR (mean time to recover) — average time from failure detection to restored service. Smaller is better. Dominated by detection time, on-call response, and the runbook.
Availability ≈ MTBF / (MTBF + MTTR). Reliability is MTBF; availability also rewards small MTTR.

You can be very reliable (rare failures) with low availability (slow recovery) — and vice versa. A system that crashes once a month and stays down for 30 seconds has higher availability than one that runs flawlessly for a year and then takes 6 hours to recover.

Variants and trade-offs#

High availability, lower reliability — multi-replica read paths that serve stale data during partitions, eventually-consistent stores, retry-everything clients. Cheap, fast, simple. Acceptable for feeds, recommendations, counters, analytics.

High reliability, lower availability — single-leader writes with synchronous acks, refuse-on-uncertainty modes, two-phase commit. Slower, more expensive, simpler to reason about. Required for payments, inventory, identity.

The other axis is where reliability is enforced:

End-to-end at the application — idempotency keys on every mutation, dedup on the write path, reconciliation jobs that compare ledger against source-of-truth. Most defensive; most code to write.
In the data layer — transactions, unique constraints, fk constraints, serializable isolation. Cheaper if your database supports it; pushes correctness down to a layer you trust.
At the network boundary — retry policies, circuit breakers, idempotency at the API gateway. Catches the easy 80%; doesn’t help with correctness bugs inside the service.

Cloud object stores publish durability at 11 nines (10^-11 annual loss probability per object) but availability “only” at 4 nines. This is the canonical example of the orthogonality: their replication is brutally redundant, but the control plane serving those objects has normal-amounts-of-downtime.

Why exactly-once delivery is harder than it sounds

No distributed messaging system can give true exactly-once across arbitrary failures — the FLP impossibility and the two-generals problem both forbid it. What real systems give is “at-least-once delivery + idempotent processing” or “transactional outbox + exactly-once consumer offsets”. Both rely on the consumer maintaining a dedup table; neither makes the wire-level delivery itself exactly-once.

When this is asked in interviews#

Three reliable triggers:

The durability question. “What’s your data-loss tolerance — RPO?” If the candidate’s design loses uncommitted writes during a primary failover and doesn’t acknowledge it, that’s a flag.
The retry / idempotency drill. “Your processPayment RPC times out. Now what?” Bad answer: retry. Good answer: idempotency key, server-side dedup table with TTL, reconciliation job.
The “wrong but up” trap. “What if your replica is up but its data is corrupted?” Tests whether the candidate distinguishes availability from reliability and knows to fail closed.

Most aggressive at any domain where wrong answers are worse than no answer: payments, identity (Auth0, Okta), critical infrastructure, anything regulated, billing platforms. Senior-and-above bars everywhere.

Common follow-ups:

“What’s your RPO and RTO? How does the design enforce each?”
“Where does your system fail closed versus open, and why those choices?”
“Walk me through a scenario where the system is available but the answer is wrong. What catches it?”