Scalability

Vertical, horizontal, and elasticity. Why naive scaling stalls and which axis to pick first.

Concept Foundational
5 min read
scalability nfr capacity-planning

Summary#

Scalability is the property of handling more load — more users, more data, more requests — without rewriting the system. It comes in three shapes: vertical (bigger machines), horizontal (more machines), and elastic (machines added and removed on demand). The interesting question is rarely “can it scale” but “what breaks first at 10× load, and can we fix that without redesigning?”

Why it matters#

Most interview prompts hand you a current scale and ask you to design for 10× or 100×. The graded skill is identifying which component becomes the bottleneck first, not designing a system that’s already at 100× scale. Over-design now and you’re spending budget on capacity you don’t need; under-design and you’ll be rewriting at the worst possible moment.

The senior signal is comfort with trajectory. “We’re at 10k QPS today; at 100k QPS the database is the bottleneck and we’d shard on user_id; at 1M QPS we’d add a write-through cache and consider region-pinning users.” That’s a scaling story; “we’ll add more servers” is not.

How it works#

Vertical scaling (scale up)#

Bigger CPU, more RAM, faster disk, faster NIC on the same node. Pros: no code changes, no distributed-systems problems, ACID transactions still work. Cons: hard ceiling around a few terabytes of RAM, a couple million IOPS, ~100 cores. Cost grows super-linearly past the commodity range. Use for: anything single-leader, anything with a hard cross-row consistency requirement, the first 6 months of a startup.

Horizontal scaling (scale out)#

More nodes, same size. Pros: practically unbounded ceiling, commodity hardware, fits cloud pricing. Cons: introduces distributed-systems problems on day one — partitioning, replication, consistency, failure handling, deployment fanning out. Use for: stateless services trivially; stateful services with a partitioning story.

Elastic scaling#

Auto-add and auto-remove nodes based on load. Pros: pay for what you use; absorbs traffic bursts. Cons: warm-up time (cold start, cache warm, connection pool fill) can be longer than the burst it’s trying to absorb. Use for: stateless tier with bursty load; ML inference; cron-driven batch.

The axes that scale differently#

Every component has a different bottleneck axis. Naming them is the senior-signal move:

  • Stateless app tier. CPU-bound. Scales horizontally cleanly. Add nodes behind a load balancer; capacity is roughly linear in node count.
  • Read-heavy database. IOPS-bound and CPU-bound. Scales horizontally via read replicas; bounded by replication lag and replica fan-out.
  • Write-heavy database. Disk-bandwidth-bound and lock-contention-bound. Hard to scale horizontally without sharding. Picking the partition key is the whole problem.
  • Cache. Memory-bound. Scales horizontally via consistent hashing. Bounded by hot-key distribution — one viral key on one shard kills the whole tier.
  • Queue / log. Network-bound and disk-bandwidth-bound. Scales horizontally via partitions. Bounded by single-partition throughput (~10 MB/sec for Kafka).
  • State at the edge (sessions, sticky data). Memory-bound. Scales by sharding the routing layer.

Variants and trade-offs#

Scale up first, scale out later — start on a single big database; defer distributed-systems complexity until forced. Works surprisingly far. Risk: the “out” migration is hugely painful when it comes — schema, query patterns, transactions all need redesign.
Scale out from day one — design for sharding, multi-region, eventually consistent from the start. Pays the distributed-systems tax immediately. Justified only if you know you’ll hit scale fast; otherwise it’s over-engineering the first 18 months of product.

The three laws to keep in mind:

  • Amdahl’s law. Speedup from parallelism is bounded by the serial fraction. A workload that’s 10% serial has a max speedup of 10× regardless of cores. Translates to: one shared lock, one single-writer queue, one critical path serializing requests — and your horizontal scale ceiling is set.
  • Universal scalability law (Gunther). Real-world throughput plateaus and then declines as more nodes contend for coordination. Past some point, adding capacity makes things worse. The cure is to remove coordination, not add nodes.
  • Little’s law. Concurrency = arrival rate × latency. Want to handle 10k QPS with 100 ms latency? You need 1000 concurrent in-flight requests. If your thread pool / connection pool is 200, you’ll queue and tip.

Sharding strategies vary by use case:

  • Range partitioning (key sorted into ranges). Good for range scans; bad for hot keys at the end of the range (time-series tail).
  • Hash partitioning (key hashed to shard). Great for even distribution; awful for range scans.
  • Geographic partitioning (user pinned to region). Best latency, hardest for cross-region operations.
  • Functional partitioning (different services own different tables). Earliest scale move; clean until you need cross-service joins.
Why 'just add more servers' doesn't scale linearly

Three forces fight you: (1) coordination overhead grows with node count — heartbeats, gossip, locks, consensus quorums — eating CPU you wanted for work; (2) shared resources (load balancer, DNS, observability backend) saturate; (3) tail latency grows because requests touching N shards wait for the slowest one. The combined effect is the difference between throughput in theory and throughput on Tuesday.

When this is asked in interviews#

Always, but the form matters. The question is rarely “is this scalable” (yes, you’ll say); it’s “what breaks at 10× / 100×” or “you’re the on-call when traffic doubles overnight — walk me through what you do”.

The strongest candidates volunteer scale targets in step 1 (“let’s design for 10M DAU and growth to 100M”), do the back-of-envelope math in step 2, and return in step 7 to enumerate failure points at the higher scale.

Most aggressive at any high-traffic consumer company (Meta, Google, Netflix, ByteDance, Snap), at infra-heavy startups (Stripe, Cloudflare, Datadog, Snowflake), and at any “scale” team interview. Less aggressive at internal-tool / B2B-low-traffic roles.

Common follow-ups:

  • “Pick the first three components to fail at 10×. What’s your fix for each?”
  • “How does your partition key behave when 80% of writes are to 5% of keys?”
  • “Walk me through your scaling story from where it is today to 10×.”
  • “Where does adding more servers stop helping?”
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.