Fault Tolerance
Designing for the worst expected failure and degrading gracefully past it.
Summary#
Fault tolerance is the system’s ability to keep operating — possibly in a degraded mode — when components fail. It’s defined relative to a fault budget: which failures you’ve committed to surviving, which you’ve committed to degrading through, and which you’ve explicitly declined to handle. “Fault tolerant” without that budget is hand-waving.
Why it matters#
Every fault-tolerance claim is implicitly a claim about the failure model. The senior engineer’s move is to name the budget before designing for it: “tolerate one zone failure, degrade through one region failure, refuse to design for a global cloud outage.” That single sentence does more than ten paragraphs of generic redundancy talk.
The other reason it matters in interviews: graceful degradation is one of the cheapest signals to give. Saying “during a partial outage, we serve cached read-only results and disable writes” demonstrates more design thinking than any redundancy story.
How it works#
The fault budget#
Three categories of failure, each with a separate handling strategy:
- Tolerate. Failure is invisible to users. Examples: single node crash with health-check failover, single zone failure with multi-AZ replicas, transient network drop with retry. Budget: typical operating conditions, designed-for redundancy.
- Degrade. Failure is visible to users as reduced functionality but not as an outage. Examples: read-only mode during a database failover, cached / stale results during a regional issue, recommendations disabled when the ranker is down. Budget: rare but expected failures.
- Decline. Failure is explicitly not handled — the system goes down and recovery is operational. Examples: global cloud outage, multiple-region simultaneous loss, catastrophic data corruption. Budget: rare enough that the cost of resilience exceeds the value.
Every component should have a documented behaviour in each of the first two categories. The third is a business decision, not an engineering one.
Tactics#
- Redundancy. Multiple replicas, multiple zones, multiple regions. Buys “tolerate” for the corresponding scope of failure. Cost: $N copies and the coordination among them.
- Bulkheads. Isolate failure domains. Different services on different pools so one service’s overload doesn’t starve another. Different connection pools per downstream so one slow dependency doesn’t tip the whole tier.
- Circuit breakers. Stop calling a failing dependency for a cool-down window; serve cached or default response. Reopens probe-then-half-open then full.
- Timeouts. Every network call has a deadline. Default-no-timeout is the most common production bug.
- Retries with backoff and jitter. Exponential backoff prevents thundering-herd; jitter de-synchronizes retries that all started at the same moment.
- Idempotency keys. Allow safe retries. Without them, retries multiply effects.
- Graceful degradation paths. Pre-decided “if X fails, serve Y” plans, exercised regularly in game-days.
- Backpressure. When downstream slows, propagate the slow-down upstream rather than queueing forever.
- Load shedding. When the system is overloaded, drop the cheapest requests first and protect the critical path.
Graceful degradation patterns#
The valuable design move is naming concrete degradation paths:
- Read-only mode. Writes refused; reads continue from replicas. Used during primary failover, schema migration, or write-path incident.
- Cached fallback. Serve last-known-good from a cache (or CDN) when the origin is down. Pairs with stale-while-revalidate.
- Feature-flag kill switches. Disable a feature path (recommendations, search, comments) at runtime when its backend is unhealthy. The rest of the product keeps working.
- Static fallback. Replace dynamic content with a static placeholder (“we’re having a moment, here’s the homepage”).
- Deferred work. Queue the request, acknowledge synchronously, process when the dependency recovers. Useful for non-immediate workflows (email, indexing, analytics).
Variants and trade-offs#
The other axis is blast radius. Fault tolerance is set by the smallest unit of independent failure:
- One process — restart is sub-second; impact is one tenant or one shard.
- One node — recovery via failover; impact is a slice of traffic.
- One zone — multi-AZ replicas absorb; impact is a region’s capacity drop.
- One region — multi-region active-active absorbs; impact is latency for users routed away.
- One provider / continent — multi-cloud or multi-CDN; rare design target.
Smaller blast radius → less code does fault handling → simpler design. Most blast-radius work is upstream — picking the right partitioning so failure isolates rather than cascades.
Cascading failure: the failure mode the design didn't plan for
Cascades happen when one component’s failure overloads its neighbours: a slow database makes app servers hold connections longer, so they run out of threads, so retries pile up, so the load balancer’s queue saturates, so health checks time out, so more app servers are marked unhealthy, so the remaining ones receive more traffic. The cure isn’t “more redundancy” — it’s backpressure and load shedding at every layer, so each layer protects itself when its dependency is slow.
When this is asked in interviews#
Three flavours of question:
- “What happens when X fails?” — for every component in the design. The expected answer is concrete: which alternative path, which degradation, which detection mechanism, which recovery time.
- “Walk me through a partial outage.” — pick a single failure (the cache, the primary region, the auth service) and trace the user experience. The strong candidate names a specific degraded behaviour, not “we’d handle it”.
- “What’s your fault budget?” — most explicit; rarer. The right answer is the three-bucket framing above.
Heavily weighted at infrastructure / SRE / platform-team interviews everywhere. Common in any senior+ loop. Less common at junior loops, where “have redundancy” is enough.
Common follow-ups:
- “Your cache tier dies entirely. What does the user see in the first 30 seconds? In the first 5 minutes?”
- “Your circuit breaker is open. What’s the user-visible behaviour, and when does it close?”
- “Where in this design does a retry storm originate, and what stops it?”
- “What’s the smallest failure that takes down the whole system? Can you shrink it?”
Related concepts#