Failure Models — System Design · Engineering Playbook

Summary#

A failure model is a statement of what kinds of failure your system promises to handle. Every protocol claim (“Raft tolerates one minority failure”) implicitly assumes a failure model — usually crash-stop or crash-recovery on nodes, and omission on the network. Get the model wrong and the claim is meaningless.

Why it matters#

The phrase “fault-tolerant” without a failure model is marketing. The interviewer probing for senior signal isn’t asking whether you handle failure — they’re asking which failures you’ve named and which you’ve quietly excluded.

The cost gradient is steep: crash-stop is cheap, Byzantine is exponentially expensive. Designs that assume the wrong model either over-engineer (paying Byzantine costs for crash-stop reality) or under-engineer (claiming Raft tolerates network failures it doesn’t).

How it works#

The classical hierarchy, weakest assumption first:

Crash-stop (fail-stop)#

Nodes either work correctly or stop permanently. No partial behaviour, no recovery. Easiest model to reason about — every node is either “in the quorum” or “out of it forever”. Real life is rarely this kind, but it’s the baseline assumption of most consensus protocol proofs.

Crash-recovery#

Nodes may crash and later restart with intact persistent state. The node may have lost everything in memory but disk survives. This is the realistic model for most servers. It’s what Raft, Paxos, ZooKeeper, and most production systems actually assume. The trick: when a node returns, the protocol must reintegrate it correctly without “rolling back” durably-committed state.

Omission#

Messages may be silently dropped — by the network, by an overloaded receiver, by a half-open connection. The sender doesn’t know if the message was lost or merely delayed. This is the network’s failure model, and it’s why every RPC needs a timeout, every protocol needs retries, and every retry needs an idempotency key.

Timing (asynchronous network)#

Messages eventually arrive but with no upper bound on delay. The FLP impossibility result lives here: in a fully asynchronous network with even one crash-prone node, no deterministic protocol can guarantee consensus. Production systems sidestep this by assuming partial synchrony — bounded delay most of the time, occasional unbounded delay during incidents.

Byzantine#

Nodes may behave arbitrarily — return wrong answers, lie, collude, send different messages to different peers. The strongest (worst) failure model. Needed for: blockchain, mutually-distrusting parties, security-critical multi-tenant systems. Costs 3f+1 nodes to tolerate f Byzantine failures (versus 2f+1 for crash-recovery), plus signed messages, plus complex view changes. Don’t assume Byzantine in an interview unless the prompt explicitly mentions adversarial parties.

Variants and trade-offs#

Weaker model assumed (crash-stop) — simpler design, cheaper coordination, fewer messages. Vulnerable if real failures are richer than the model — e.g. a “crashed” node that’s actually a slow node still serving stale reads through a cached client.

Stronger model assumed (Byzantine) — robust against arbitrary misbehaviour, but 1.5× the replica count and 5–10× the message complexity. Justified only when you actually cannot trust participants — public blockchains, federated systems with adversarial tenants.

The asymmetric trap: assume a model that’s too weak, and a real-world failure mode just isn’t handled — the system silently does the wrong thing. Assume one that’s too strong, and you pay forever. Both errors look fine until they don’t.

Most interview answers should land on crash-recovery for nodes + omission for the network + partial synchrony for timing. That’s the assumption stack of essentially every production database, every consensus log, every leader-elected service mesh.

Grey failures — slow, intermittent, half-working — are the genuinely hard case. A node that responds in 30 seconds when it usually responds in 30 ms hasn’t crashed, hasn’t omitted, isn’t lying. It’s just limping. None of the classical models cover this well, and most production outages live exactly here.

Why FLP doesn't kill consensus in practice

The Fischer-Lynch-Paterson impossibility says no deterministic asynchronous-network protocol can guarantee consensus with one crashed process. Production systems work around it by (a) assuming partial synchrony — the network is mostly well-behaved, occasionally not — and (b) introducing randomness (leader election timers, jitter). Consensus then becomes a probability bound rather than a guarantee: with high probability, the system makes progress.

When this is asked in interviews#

The keyword is “what happens when”. As soon as the interviewer asks “what happens when this fails”, they’re probing the candidate’s failure model. The signal is whether the candidate names the kind of failure before describing the response.

“What happens when the leader crashes?” → crash-recovery; election + log reconciliation.
“What happens when the network drops 5% of packets?” → omission at non-zero rate; retries, idempotency, possibly circuit breakers.
“What happens when the replica lies about its committed offset?” → not in scope unless prompt says so; this is Byzantine territory.

Heavily weighted at infrastructure-track interviews (database, storage, networking, distributed compute teams). Less common at product-team loops, though the AWS / Google / Meta L5+ bar will still expect it.

Common follow-ups:

“Your protocol says it tolerates f failures. f of what, exactly?”
“Distinguish a slow node from a failed node. How does your design tell them apart, and what happens if it gets it wrong?”
“Where are you assuming Byzantine resistance and where are you assuming honest participants?”