Availability
Nines as a budget, redundancy strategies, failover modes, and the cost of each 9.
Summary#
Availability is the fraction of time a system is reachable and responding correctly. It’s commonly quoted in “nines” — 99.9%, 99.99% — but the right framing is as a budget: how much downtime you can spend per year on planned and unplanned events. Every extra nine costs roughly an order of magnitude more engineering and infrastructure.
Why it matters#
“Highly available” is a non-answer. The interviewer wants a target with a unit. Quoting 99.99% in a design conversation is implicitly committing to multi-zone redundancy, automatic failover, deploy strategies that don’t drop traffic, and a runbook that recovers in under five minutes. Quoting 99.9% is committing to roughly an order of magnitude less infrastructure.
The other thing the target controls: how much money you spend. The fourth nine costs ~3× the third; the fifth often costs another 5–10×. Picking a higher number than you need is the silent killer of architecture budgets.
How it works#
Convert nines to downtime per year by remembering one anchor: 99.9% ≈ 8.77 hours/year.
| Target | Downtime/year | Downtime/month | Downtime/week |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.2 hours | 1.68 hours |
| 99.9% (three nines) | 8.77 hours | 43.8 minutes | 10.1 minutes |
| 99.99% (four nines) | 52.6 minutes | 4.4 minutes | 1.01 minutes |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | 6.05 seconds |
| 99.9999% (six nines) | 31.5 seconds | 2.6 seconds | 0.6 seconds |
Redundancy strategies (in order of cost)#
- Single instance. No redundancy; availability ≈ MTBF/(MTBF+MTTR). Hard to push past two nines.
- Active-passive (hot standby). Standby replica kept warm; failover swaps traffic on detection. Buys you one nine over single-instance. Caveat: failover detection and DNS / load-balancer convergence takes seconds-to-minutes — that’s the floor on your recovery time.
- Active-active within one region (multi-AZ). Every zone serves traffic; one zone failing degrades capacity, not availability. Standard pattern for three nines. Cross-AZ latency (~1–2 ms) is the cost.
- Active-active across regions. Multi-region traffic, with global routing (Anycast, GeoDNS) sending users to the nearest healthy region. Required for four nines and above. Cost: cross-region replication, conflict handling, far more complex deployment.
- Active-active across providers / continents. Multi-cloud failover. Buys resilience against single-provider outages. Costs explode — duplicated everything, distinct toolchains, conflicting consistency models.
Where the downtime actually comes from#
Two categories dominate, and they don’t add the way you’d expect:
- Planned changes — deploys, schema migrations, runtime upgrades, certificate rotation. These are 60–80% of the downtime a user actually experiences at companies that haven’t invested in zero-downtime deployment.
- Unplanned failures — hardware death, dependency outage, software bug, capacity exhaustion, mistaken operator command. These are what get postmortems written, but planned changes burn more nines.
Variants and trade-offs#
The other variant axis is failover automation:
- Manual failover. Human in the loop. Recovery time measured in 15–60 minute increments. Floor for availability is ~99.5%.
- Automated failover with manual confirmation. System raises alert; on-call presses the button. 5–15 minutes. ~99.9%.
- Fully automated failover (health-check-driven). Seconds. ~99.99% achievable. Risk: false-positive flips (a slow health check triggers failover into a worse state).
Internal-only services typically target three nines. User-facing consumer products target three to four. Payments, telco, ad-bidding target four to five. Air-traffic-control / pacemaker-equivalent target six — but those aren’t normal interview territory.
The math of dependent and parallel components
Two services in series (one calls the other): A_total = A_1 × A_2. Two services in parallel (request succeeds if either works): A_total = 1 - (1-A_1) × (1-A_2). Two 99% services in series ≈ 98%. Two 99% services in parallel ≈ 99.99%. This is why redundancy works and why long dependency chains are dangerous.
When this is asked in interviews#
In step 1 (NFR clarification), as a target. In step 7 (evaluation), as a critique. The strongest candidates volunteer a target — “let’s aim for 99.95% which is ~4.4 hours of downtime / year” — and tie it to specific design choices later.
Most aggressive at infrastructure / SRE-track interviews (Cloudflare, Fastly, AWS, GCP, Stripe, Coinbase), at on-call-heavy roles, and at any FAANG L5+ loop. Less aggressive at frontend or early-product loops.
Common follow-ups:
- “Where does the next nine come from in your design?” — should point to the bottleneck dependency or the manual step.
- “What’s the MTTR of your failover?” — should be a specific number, defended by what monitoring + automation backs it.
- “If your dependency is 99.9% available, can your service ever exceed 99.9%?” — only if the dependency call can be optionally skipped (cache, graceful degradation, async fallback).
- “Walk me through a deploy. Does the availability target survive it?” — tests whether the candidate knows zero-downtime deploy patterns or is quietly assuming maintenance windows.
Related concepts#