Scalability — System Design · Engineering Playbook

Summary#

Scalability is the property of handling more load — more users, more data, more requests — without rewriting the system. It comes in three shapes: vertical (bigger machines), horizontal (more machines), and elastic (machines added and removed on demand). The interesting question is rarely “can it scale” but “what breaks first at 10× load, and can we fix that without redesigning?”

Why it matters#

Most interview prompts hand you a current scale and ask you to design for 10× or 100×. The graded skill is identifying which component becomes the bottleneck first, not designing a system that’s already at 100× scale. Over-design now and you’re spending budget on capacity you don’t need; under-design and you’ll be rewriting at the worst possible moment.

The senior signal is comfort with trajectory. “We’re at 10k QPS today; at 100k QPS the database is the bottleneck and we’d shard on user_id; at 1M QPS we’d add a write-through cache and consider region-pinning users.” That’s a scaling story; “we’ll add more servers” is not.

How it works#

Vertical scaling (scale up)#

Bigger CPU, more RAM, faster disk, faster NIC on the same node. Pros: no code changes, no distributed-systems problems, ACID transactions still work. Cons: hard ceiling around a few terabytes of RAM, a couple million IOPS, ~100 cores. Cost grows super-linearly past the commodity range. Use for: anything single-leader, anything with a hard cross-row consistency requirement, the first 6 months of a startup.

Horizontal scaling (scale out)#

More nodes, same size. Pros: practically unbounded ceiling, commodity hardware, fits cloud pricing. Cons: introduces distributed-systems problems on day one — partitioning, replication, consistency, failure handling, deployment fanning out. Use for: stateless services trivially; stateful services with a partitioning story.

Elastic scaling#

Auto-add and auto-remove nodes based on load. Pros: pay for what you use; absorbs traffic bursts. Cons: warm-up time (cold start, cache warm, connection pool fill) can be longer than the burst it’s trying to absorb. Use for: stateless tier with bursty load; ML inference; cron-driven batch.

The axes that scale differently#

Every component has a different bottleneck axis. Naming them is the senior-signal move:

Stateless app tier. CPU-bound. Scales horizontally cleanly. Add nodes behind a load balancer; capacity is roughly linear in node count.
Read-heavy database. IOPS-bound and CPU-bound. Scales horizontally via read replicas; bounded by replication lag and replica fan-out.
Write-heavy database. Disk-bandwidth-bound and lock-contention-bound. Hard to scale horizontally without sharding. Picking the partition key is the whole problem.
Cache. Memory-bound. Scales horizontally via consistent hashing. Bounded by hot-key distribution — one viral key on one shard kills the whole tier.
Queue / log. Network-bound and disk-bandwidth-bound. Scales horizontally via partitions. Bounded by single-partition throughput (~10 MB/sec for Kafka).
State at the edge (sessions, sticky data). Memory-bound. Scales by sharding the routing layer.

Variants and trade-offs#

Scale up first, scale out later — start on a single big database; defer distributed-systems complexity until forced. Works surprisingly far. Risk: the “out” migration is hugely painful when it comes — schema, query patterns, transactions all need redesign.

Scale out from day one — design for sharding, multi-region, eventually consistent from the start. Pays the distributed-systems tax immediately. Justified only if you know you’ll hit scale fast; otherwise it’s over-engineering the first 18 months of product.

The three laws to keep in mind:

Amdahl’s law. Speedup from parallelism is bounded by the serial fraction. A workload that’s 10% serial has a max speedup of 10× regardless of cores. Translates to: one shared lock, one single-writer queue, one critical path serializing requests — and your horizontal scale ceiling is set.
Universal scalability law (Gunther). Real-world throughput plateaus and then declines as more nodes contend for coordination. Past some point, adding capacity makes things worse. The cure is to remove coordination, not add nodes.
Little’s law. Concurrency = arrival rate × latency. Want to handle 10k QPS with 100 ms latency? You need 1000 concurrent in-flight requests. If your thread pool / connection pool is 200, you’ll queue and tip.

Sharding strategies vary by use case:

Range partitioning (key sorted into ranges). Good for range scans; bad for hot keys at the end of the range (time-series tail).
Hash partitioning (key hashed to shard). Great for even distribution; awful for range scans.
Geographic partitioning (user pinned to region). Best latency, hardest for cross-region operations.
Functional partitioning (different services own different tables). Earliest scale move; clean until you need cross-service joins.

Why 'just add more servers' doesn't scale linearly

Three forces fight you: (1) coordination overhead grows with node count — heartbeats, gossip, locks, consensus quorums — eating CPU you wanted for work; (2) shared resources (load balancer, DNS, observability backend) saturate; (3) tail latency grows because requests touching N shards wait for the slowest one. The combined effect is the difference between throughput in theory and throughput on Tuesday.

When this is asked in interviews#

Always, but the form matters. The question is rarely “is this scalable” (yes, you’ll say); it’s “what breaks at 10× / 100×” or “you’re the on-call when traffic doubles overnight — walk me through what you do”.

The strongest candidates volunteer scale targets in step 1 (“let’s design for 10M DAU and growth to 100M”), do the back-of-envelope math in step 2, and return in step 7 to enumerate failure points at the higher scale.

Most aggressive at any high-traffic consumer company (Meta, Google, Netflix, ByteDance, Snap), at infra-heavy startups (Stripe, Cloudflare, Datadog, Snowflake), and at any “scale” team interview. Less aggressive at internal-tool / B2B-low-traffic roles.

Common follow-ups:

“Pick the first three components to fail at 10×. What’s your fix for each?”
“How does your partition key behave when 80% of writes are to 5% of keys?”
“Walk me through your scaling story from where it is today to 10×.”
“Where does adding more servers stop helping?”