Scalability
Vertical, horizontal, and elasticity. Why naive scaling stalls and which axis to pick first.
Summary#
Scalability is the property of handling more load — more users, more data, more requests — without rewriting the system. It comes in three shapes: vertical (bigger machines), horizontal (more machines), and elastic (machines added and removed on demand). The interesting question is rarely “can it scale” but “what breaks first at 10× load, and can we fix that without redesigning?”
Why it matters#
Most interview prompts hand you a current scale and ask you to design for 10× or 100×. The graded skill is identifying which component becomes the bottleneck first, not designing a system that’s already at 100× scale. Over-design now and you’re spending budget on capacity you don’t need; under-design and you’ll be rewriting at the worst possible moment.
The senior signal is comfort with trajectory. “We’re at 10k QPS today; at 100k QPS the database is the bottleneck and we’d shard on user_id; at 1M QPS we’d add a write-through cache and consider region-pinning users.” That’s a scaling story; “we’ll add more servers” is not.
How it works#
Vertical scaling (scale up)#
Bigger CPU, more RAM, faster disk, faster NIC on the same node. Pros: no code changes, no distributed-systems problems, ACID transactions still work. Cons: hard ceiling around a few terabytes of RAM, a couple million IOPS, ~100 cores. Cost grows super-linearly past the commodity range. Use for: anything single-leader, anything with a hard cross-row consistency requirement, the first 6 months of a startup.
Horizontal scaling (scale out)#
More nodes, same size. Pros: practically unbounded ceiling, commodity hardware, fits cloud pricing. Cons: introduces distributed-systems problems on day one — partitioning, replication, consistency, failure handling, deployment fanning out. Use for: stateless services trivially; stateful services with a partitioning story.
Elastic scaling#
Auto-add and auto-remove nodes based on load. Pros: pay for what you use; absorbs traffic bursts. Cons: warm-up time (cold start, cache warm, connection pool fill) can be longer than the burst it’s trying to absorb. Use for: stateless tier with bursty load; ML inference; cron-driven batch.
The axes that scale differently#
Every component has a different bottleneck axis. Naming them is the senior-signal move:
- Stateless app tier. CPU-bound. Scales horizontally cleanly. Add nodes behind a load balancer; capacity is roughly linear in node count.
- Read-heavy database. IOPS-bound and CPU-bound. Scales horizontally via read replicas; bounded by replication lag and replica fan-out.
- Write-heavy database. Disk-bandwidth-bound and lock-contention-bound. Hard to scale horizontally without sharding. Picking the partition key is the whole problem.
- Cache. Memory-bound. Scales horizontally via consistent hashing. Bounded by hot-key distribution — one viral key on one shard kills the whole tier.
- Queue / log. Network-bound and disk-bandwidth-bound. Scales horizontally via partitions. Bounded by single-partition throughput (~10 MB/sec for Kafka).
- State at the edge (sessions, sticky data). Memory-bound. Scales by sharding the routing layer.
Variants and trade-offs#
The three laws to keep in mind:
- Amdahl’s law. Speedup from parallelism is bounded by the serial fraction. A workload that’s 10% serial has a max speedup of 10× regardless of cores. Translates to: one shared lock, one single-writer queue, one critical path serializing requests — and your horizontal scale ceiling is set.
- Universal scalability law (Gunther). Real-world throughput plateaus and then declines as more nodes contend for coordination. Past some point, adding capacity makes things worse. The cure is to remove coordination, not add nodes.
- Little’s law. Concurrency = arrival rate × latency. Want to handle 10k QPS with 100 ms latency? You need 1000 concurrent in-flight requests. If your thread pool / connection pool is 200, you’ll queue and tip.
Sharding strategies vary by use case:
- Range partitioning (key sorted into ranges). Good for range scans; bad for hot keys at the end of the range (time-series tail).
- Hash partitioning (key hashed to shard). Great for even distribution; awful for range scans.
- Geographic partitioning (user pinned to region). Best latency, hardest for cross-region operations.
- Functional partitioning (different services own different tables). Earliest scale move; clean until you need cross-service joins.
Why 'just add more servers' doesn't scale linearly
Three forces fight you: (1) coordination overhead grows with node count — heartbeats, gossip, locks, consensus quorums — eating CPU you wanted for work; (2) shared resources (load balancer, DNS, observability backend) saturate; (3) tail latency grows because requests touching N shards wait for the slowest one. The combined effect is the difference between throughput in theory and throughput on Tuesday.
When this is asked in interviews#
Always, but the form matters. The question is rarely “is this scalable” (yes, you’ll say); it’s “what breaks at 10× / 100×” or “you’re the on-call when traffic doubles overnight — walk me through what you do”.
The strongest candidates volunteer scale targets in step 1 (“let’s design for 10M DAU and growth to 100M”), do the back-of-envelope math in step 2, and return in step 7 to enumerate failure points at the higher scale.
Most aggressive at any high-traffic consumer company (Meta, Google, Netflix, ByteDance, Snap), at infra-heavy startups (Stripe, Cloudflare, Datadog, Snowflake), and at any “scale” team interview. Less aggressive at internal-tool / B2B-low-traffic roles.
Common follow-ups:
- “Pick the first three components to fail at 10×. What’s your fix for each?”
- “How does your partition key behave when 80% of writes are to 5% of keys?”
- “Walk me through your scaling story from where it is today to 10×.”
- “Where does adding more servers stop helping?”
Related concepts#