Availability — System Design · Engineering Playbook

Summary#

Availability is the fraction of time a system is reachable and responding correctly. It’s commonly quoted in “nines” — 99.9%, 99.99% — but the right framing is as a budget: how much downtime you can spend per year on planned and unplanned events. Every extra nine costs roughly an order of magnitude more engineering and infrastructure.

Why it matters#

“Highly available” is a non-answer. The interviewer wants a target with a unit. Quoting 99.99% in a design conversation is implicitly committing to multi-zone redundancy, automatic failover, deploy strategies that don’t drop traffic, and a runbook that recovers in under five minutes. Quoting 99.9% is committing to roughly an order of magnitude less infrastructure.

The other thing the target controls: how much money you spend. The fourth nine costs ~3× the third; the fifth often costs another 5–10×. Picking a higher number than you need is the silent killer of architecture budgets.

How it works#

Convert nines to downtime per year by remembering one anchor: 99.9% ≈ 8.77 hours/year.

Target	Downtime/year	Downtime/month	Downtime/week
99% (two nines)	3.65 days	7.2 hours	1.68 hours
99.9% (three nines)	8.77 hours	43.8 minutes	10.1 minutes
99.99% (four nines)	52.6 minutes	4.4 minutes	1.01 minutes
99.999% (five nines)	5.26 minutes	26.3 seconds	6.05 seconds
99.9999% (six nines)	31.5 seconds	2.6 seconds	0.6 seconds

Redundancy strategies (in order of cost)#

Single instance. No redundancy; availability ≈ MTBF/(MTBF+MTTR). Hard to push past two nines.
Active-passive (hot standby). Standby replica kept warm; failover swaps traffic on detection. Buys you one nine over single-instance. Caveat: failover detection and DNS / load-balancer convergence takes seconds-to-minutes — that’s the floor on your recovery time.
Active-active within one region (multi-AZ). Every zone serves traffic; one zone failing degrades capacity, not availability. Standard pattern for three nines. Cross-AZ latency (~1–2 ms) is the cost.
Active-active across regions. Multi-region traffic, with global routing (Anycast, GeoDNS) sending users to the nearest healthy region. Required for four nines and above. Cost: cross-region replication, conflict handling, far more complex deployment.
Active-active across providers / continents. Multi-cloud failover. Buys resilience against single-provider outages. Costs explode — duplicated everything, distinct toolchains, conflicting consistency models.

Where the downtime actually comes from#

Two categories dominate, and they don’t add the way you’d expect:

Planned changes — deploys, schema migrations, runtime upgrades, certificate rotation. These are 60–80% of the downtime a user actually experiences at companies that haven’t invested in zero-downtime deployment.
Unplanned failures — hardware death, dependency outage, software bug, capacity exhaustion, mistaken operator command. These are what get postmortems written, but planned changes burn more nines.

Variants and trade-offs#

Single-region multi-AZ — covers ~98% of real-world failure modes (host failure, zone-level network blip, deploy issues). Sub-millisecond replication. Same legal jurisdiction. Three nines achievable, four nines hard.

Multi-region active-active — covers regional disasters, fiber cuts, provider-zone outages. Buys four to five nines. Costs: cross-region replication lag (50–150 ms), conflict handling, doubled infrastructure, vastly more deployment complexity.

The other variant axis is failover automation:

Manual failover. Human in the loop. Recovery time measured in 15–60 minute increments. Floor for availability is ~99.5%.
Automated failover with manual confirmation. System raises alert; on-call presses the button. 5–15 minutes. ~99.9%.
Fully automated failover (health-check-driven). Seconds. ~99.99% achievable. Risk: false-positive flips (a slow health check triggers failover into a worse state).

Internal-only services typically target three nines. User-facing consumer products target three to four. Payments, telco, ad-bidding target four to five. Air-traffic-control / pacemaker-equivalent target six — but those aren’t normal interview territory.

The math of dependent and parallel components

Two services in series (one calls the other): A_total = A_1 × A_2. Two services in parallel (request succeeds if either works): A_total = 1 - (1-A_1) × (1-A_2). Two 99% services in series ≈ 98%. Two 99% services in parallel ≈ 99.99%. This is why redundancy works and why long dependency chains are dangerous.

When this is asked in interviews#

In step 1 (NFR clarification), as a target. In step 7 (evaluation), as a critique. The strongest candidates volunteer a target — “let’s aim for 99.95% which is ~4.4 hours of downtime / year” — and tie it to specific design choices later.

Most aggressive at infrastructure / SRE-track interviews (Cloudflare, Fastly, AWS, GCP, Stripe, Coinbase), at on-call-heavy roles, and at any FAANG L5+ loop. Less aggressive at frontend or early-product loops.

Common follow-ups:

“Where does the next nine come from in your design?” — should point to the bottleneck dependency or the manual step.
“What’s the MTTR of your failover?” — should be a specific number, defended by what monitoring + automation backs it.
“If your dependency is 99.9% available, can your service ever exceed 99.9%?” — only if the dependency call can be optionally skipped (cache, graceful degradation, async fallback).
“Walk me through a deploy. Does the availability target survive it?” — tests whether the candidate knows zero-downtime deploy patterns or is quietly assuming maintenance windows.