Databases — System Design · Engineering Playbook

Use cases#

“Database” covers radically different storage shapes. The right one depends on the access pattern:

Relational (Postgres, MySQL) — strong consistency, joins across normalized data, transactions. Default for anything financial, account-centric, or schema-evolving.
Document (MongoDB, Couchbase) — JSON-shaped aggregates with flexible schema. Default for content management, user profiles, catalog entries that don’t need cross-row joins.
Wide-column (Cassandra, HBase, ScyllaDB) — partition-key-driven, write-optimized, eventually consistent. Default for time-series, event logs, IoT, anything write-heavy with predictable read patterns.
Key-value (Redis, DynamoDB, RocksDB) — single-key lookups, sub-millisecond latency. Default for sessions, caches, feature flags. Covered in detail in Key-Value Store.
Graph (Neo4j, JanusGraph) — multi-hop traversal as a first-class operation. Default for social graphs, fraud rings, knowledge graphs.

Most real systems use 3-5 of these in combination — the polyglot persistence pattern.

Functional requirements#

Durable storage of structured data with a defined query language (SQL or a per-database DSL).
Indexing for predictable lookups beyond the primary key.
Transactions or at least atomic single-key operations.
Backup, restore, and point-in-time recovery.
Replication for read scale-out and failover.

Non-functional requirements#

Latency: OLTP reads p99 < 10 ms; writes p99 < 50 ms. Sub-millisecond for in-memory stores.
Durability: writes must survive disk failure, machine failure, and (with multi-region replication) region failure. Quantified as nines of durability — 11 nines (99.999999999%) is the AWS S3 marketing number.
Throughput: a single Postgres instance comfortably handles 10-50k QPS of mixed workload; sharded Cassandra clusters push past 1M writes/sec.
Consistency: pick a target on the linearizable → eventually-consistent axis. Most relational databases default to read-committed, not serializable.

High-level design#

            ┌──────────── primary ────────────┐
clients ──> │  query parser → planner → exec   │
            │             │                    │
            │   buffer pool  ──>  storage      │  (WAL, datafiles)
            └──────────────│──────────────────┘
                           │
                  ┌────────┴────────┐
                  │                  │
              replica 1          replica 2  (read-only, async)

A typical OLTP database: a single primary accepts writes, streams its WAL (write-ahead log) to replicas. Replicas serve reads. Failover promotes a replica when the primary dies. Sharded databases multiply this — one primary per shard, often with each shard replicated 3-way.

Detailed design#

Picking the shape#

Workload	Shape	Why
Bank ledger, ecommerce checkout	Relational	Transactions, joins, schema discipline
Product catalog, CMS pages, user profiles	Document	Flexible nested data, no cross-doc joins
Time-series, IoT, click logs	Wide-column	Write-optimized, partition-key-driven
Session store, feature flags, leaderboard	Key-value	Sub-ms latency, single-key access
Friend-of-friend, fraud rings, recommendations	Graph	Multi-hop traversal in one query

Storage engines#

B-tree (Postgres, MySQL InnoDB): balanced on-disk tree, optimized for reads. Updates rewrite pages in place. Random I/O on writes, fast lookups.

LSM-tree (Cassandra, RocksDB, LevelDB): writes go to a memtable, flushed sequentially to immutable SSTables, compacted in background. Sequential I/O on writes (great for SSDs and write-heavy loads), reads must check multiple SSTables (bloom filters help).

Indexes#

Primary index: the table itself, sorted by primary key. Secondary indexes: separate B-tree (or LSM) keyed by some other column, pointing back to the primary row.

The cost of indexes is write amplification: every insert updates the primary + every secondary index. The classic foot-gun is over-indexing an OLTP table — write latency triples.

Composite indexes only help queries whose WHERE clauses use the leading columns. An index on (user_id, created_at) answers WHERE user_id = ? AND created_at > ?. The same index does not answer WHERE created_at > ? alone.

Transactions and isolation#

The four ANSI isolation levels (from weakest to strongest): read-uncommitted, read-committed, repeatable-read, serializable. Most databases default to read-committed; Postgres treats repeatable-read as snapshot isolation. True serializability is expensive — usually implemented via SSI (Serializable Snapshot Isolation) or 2PL.

Cross-shard transactions are dramatically more expensive than single-shard. Two-phase commit (2PC) is the textbook answer; modern systems (Spanner, CockroachDB) use TrueTime or hybrid logical clocks to avoid 2PC’s coordination overhead.

Replication#

synchronous     — primary waits for replica ack. Zero data loss, latency = max(primary, slowest replica).
asynchronous    — primary acks before replica catches up. Lower latency, data loss window = replication lag.
quorum          — primary waits for K of N replicas. Tunable per-write (DynamoDB's W, Cassandra's CL).

Sharding#

When a single primary can’t keep up:

Hash sharding — hash(key) mod N. Uniform distribution, hard to scan ranges, hard to resize. Use consistent hashing to soften resize.
Range sharding — sort by key, slice into ranges. Range scans are cheap; hot ranges are the failure mode (auto-incrementing IDs land all writes on the last shard).
Directory sharding — a lookup service maps each key to a shard. Most flexible, requires an extra hop.

Trade-offs#

SQL / relational — schema discipline catches bugs at migration time; joins keep app code simple; ACID gives transactional safety. Scaling writes past one primary is non-trivial (Vitess, Citus, partitioning).

NoSQL — horizontally scales writes by partitioning out of the box; flexible schema speeds early iteration. Application now owns cross-collection consistency, denormalization, and many concerns the database used to handle.

Other tensions:

Normalize vs denormalize — relational defaults to normalize; document and wide-column default to denormalize. Denormalizing speeds reads but multiplies writes and risks drift.
Strong consistency vs availability — CAP says you pick during a partition; PACELC says you also pick the latency / consistency trade-off in the absence of partitions.
Operational maturity — Postgres has 30 years of tooling, drivers, and operator knowledge. A newer database may fit your workload but cost you weeks the first time it misbehaves at 3 AM.

Real-world examples#

PostgreSQL powers Instagram (sharded with Citus), Reddit, GitLab, and most YC-stage startups. The Swiss Army knife.
MySQL at Facebook handles trillions of rows across thousands of shards via the MyRocks (LSM) engine.
Cassandra runs Discord’s message store: hundreds of billions of messages across 177 nodes, partitioned by (channel_id, bucket).
DynamoDB is Amazon’s bet on hash-partitioned KV. Serverless, single-digit-ms p99, used by Lyft, Snap, and Airbnb for hot paths.
MongoDB is the default document store; Atlas Search adds inverted-index search alongside.
Neo4j powers LinkedIn’s people-you-may-know and a chunk of NASA’s incident knowledge graph.
Spanner (Google) is the production proof that you can have horizontally-scaled SQL with strong consistency — via TrueTime atomic-clock synchronization.

Key-Value Store — the simplest database shape, often the highest-QPS one.
Distributed Cache — frequently fronts a slower OLTP database.
Blob Store — where large payloads live with the database holding only a reference.
Distributed Search — the inverted-index cousin used alongside primary stores.