Databases
Relational vs document vs wide-column vs graph: when each shape fits, and the trade-off triangle.
Use cases#
“Database” covers radically different storage shapes. The right one depends on the access pattern:
- Relational (Postgres, MySQL) — strong consistency, joins across normalized data, transactions. Default for anything financial, account-centric, or schema-evolving.
- Document (MongoDB, Couchbase) — JSON-shaped aggregates with flexible schema. Default for content management, user profiles, catalog entries that don’t need cross-row joins.
- Wide-column (Cassandra, HBase, ScyllaDB) — partition-key-driven, write-optimized, eventually consistent. Default for time-series, event logs, IoT, anything write-heavy with predictable read patterns.
- Key-value (Redis, DynamoDB, RocksDB) — single-key lookups, sub-millisecond latency. Default for sessions, caches, feature flags. Covered in detail in Key-Value Store.
- Graph (Neo4j, JanusGraph) — multi-hop traversal as a first-class operation. Default for social graphs, fraud rings, knowledge graphs.
Most real systems use 3-5 of these in combination — the polyglot persistence pattern.
Functional requirements#
- Durable storage of structured data with a defined query language (SQL or a per-database DSL).
- Indexing for predictable lookups beyond the primary key.
- Transactions or at least atomic single-key operations.
- Backup, restore, and point-in-time recovery.
- Replication for read scale-out and failover.
Non-functional requirements#
- Latency: OLTP reads p99
< 10 ms; writes p99< 50 ms. Sub-millisecond for in-memory stores. - Durability: writes must survive disk failure, machine failure, and (with multi-region replication) region failure. Quantified as nines of durability — 11 nines (
99.999999999%) is the AWS S3 marketing number. - Throughput: a single Postgres instance comfortably handles 10-50k QPS of mixed workload; sharded Cassandra clusters push past 1M writes/sec.
- Consistency: pick a target on the linearizable → eventually-consistent axis. Most relational databases default to read-committed, not serializable.
High-level design#
┌──────────── primary ────────────┐clients ──> │ query parser → planner → exec │ │ │ │ │ buffer pool ──> storage │ (WAL, datafiles) └──────────────│──────────────────┘ │ ┌────────┴────────┐ │ │ replica 1 replica 2 (read-only, async)A typical OLTP database: a single primary accepts writes, streams its WAL (write-ahead log) to replicas. Replicas serve reads. Failover promotes a replica when the primary dies. Sharded databases multiply this — one primary per shard, often with each shard replicated 3-way.
Detailed design#
Picking the shape#
| Workload | Shape | Why |
|---|---|---|
| Bank ledger, ecommerce checkout | Relational | Transactions, joins, schema discipline |
| Product catalog, CMS pages, user profiles | Document | Flexible nested data, no cross-doc joins |
| Time-series, IoT, click logs | Wide-column | Write-optimized, partition-key-driven |
| Session store, feature flags, leaderboard | Key-value | Sub-ms latency, single-key access |
| Friend-of-friend, fraud rings, recommendations | Graph | Multi-hop traversal in one query |
Storage engines#
B-tree (Postgres, MySQL InnoDB): balanced on-disk tree, optimized for reads. Updates rewrite pages in place. Random I/O on writes, fast lookups.
LSM-tree (Cassandra, RocksDB, LevelDB): writes go to a memtable, flushed sequentially to immutable SSTables, compacted in background. Sequential I/O on writes (great for SSDs and write-heavy loads), reads must check multiple SSTables (bloom filters help).
Indexes#
Primary index: the table itself, sorted by primary key. Secondary indexes: separate B-tree (or LSM) keyed by some other column, pointing back to the primary row.
The cost of indexes is write amplification: every insert updates the primary + every secondary index. The classic foot-gun is over-indexing an OLTP table — write latency triples.
Composite indexes only help queries whose WHERE clauses use the leading columns. An index on (user_id, created_at) answers WHERE user_id = ? AND created_at > ?. The same index does not answer WHERE created_at > ? alone.
Transactions and isolation#
The four ANSI isolation levels (from weakest to strongest): read-uncommitted, read-committed, repeatable-read, serializable. Most databases default to read-committed; Postgres treats repeatable-read as snapshot isolation. True serializability is expensive — usually implemented via SSI (Serializable Snapshot Isolation) or 2PL.
Cross-shard transactions are dramatically more expensive than single-shard. Two-phase commit (2PC) is the textbook answer; modern systems (Spanner, CockroachDB) use TrueTime or hybrid logical clocks to avoid 2PC’s coordination overhead.
Replication#
synchronous — primary waits for replica ack. Zero data loss, latency = max(primary, slowest replica).asynchronous — primary acks before replica catches up. Lower latency, data loss window = replication lag.quorum — primary waits for K of N replicas. Tunable per-write (DynamoDB's W, Cassandra's CL).Sharding#
When a single primary can’t keep up:
- Hash sharding —
hash(key) mod N. Uniform distribution, hard to scan ranges, hard to resize. Use consistent hashing to soften resize. - Range sharding — sort by key, slice into ranges. Range scans are cheap; hot ranges are the failure mode (auto-incrementing IDs land all writes on the last shard).
- Directory sharding — a lookup service maps each key to a shard. Most flexible, requires an extra hop.
Trade-offs#
Other tensions:
- Normalize vs denormalize — relational defaults to normalize; document and wide-column default to denormalize. Denormalizing speeds reads but multiplies writes and risks drift.
- Strong consistency vs availability — CAP says you pick during a partition; PACELC says you also pick the latency / consistency trade-off in the absence of partitions.
- Operational maturity — Postgres has 30 years of tooling, drivers, and operator knowledge. A newer database may fit your workload but cost you weeks the first time it misbehaves at 3 AM.
Real-world examples#
- PostgreSQL powers Instagram (sharded with Citus), Reddit, GitLab, and most YC-stage startups. The Swiss Army knife.
- MySQL at Facebook handles trillions of rows across thousands of shards via the MyRocks (LSM) engine.
- Cassandra runs Discord’s message store: hundreds of billions of messages across 177 nodes, partitioned by
(channel_id, bucket). - DynamoDB is Amazon’s bet on hash-partitioned KV. Serverless, single-digit-ms p99, used by Lyft, Snap, and Airbnb for hot paths.
- MongoDB is the default document store; Atlas Search adds inverted-index search alongside.
- Neo4j powers LinkedIn’s people-you-may-know and a chunk of NASA’s incident knowledge graph.
- Spanner (Google) is the production proof that you can have horizontally-scaled SQL with strong consistency — via TrueTime atomic-clock synchronization.
Related building blocks#
- Key-Value Store — the simplest database shape, often the highest-QPS one.
- Distributed Cache — frequently fronts a slower OLTP database.
- Blob Store — where large payloads live with the database holding only a reference.
- Distributed Search — the inverted-index cousin used alongside primary stores.