Databases

Relational vs document vs wide-column vs graph: when each shape fits, and the trade-off triangle.

Building Block Foundational
6 min read
databases storage data-modeling
Companies this resembles: PostgreSQL · MongoDB · Cassandra · Neo4j · DynamoDB

Use cases#

“Database” covers radically different storage shapes. The right one depends on the access pattern:

  • Relational (Postgres, MySQL) — strong consistency, joins across normalized data, transactions. Default for anything financial, account-centric, or schema-evolving.
  • Document (MongoDB, Couchbase) — JSON-shaped aggregates with flexible schema. Default for content management, user profiles, catalog entries that don’t need cross-row joins.
  • Wide-column (Cassandra, HBase, ScyllaDB) — partition-key-driven, write-optimized, eventually consistent. Default for time-series, event logs, IoT, anything write-heavy with predictable read patterns.
  • Key-value (Redis, DynamoDB, RocksDB) — single-key lookups, sub-millisecond latency. Default for sessions, caches, feature flags. Covered in detail in Key-Value Store.
  • Graph (Neo4j, JanusGraph) — multi-hop traversal as a first-class operation. Default for social graphs, fraud rings, knowledge graphs.

Most real systems use 3-5 of these in combination — the polyglot persistence pattern.

Functional requirements#

  • Durable storage of structured data with a defined query language (SQL or a per-database DSL).
  • Indexing for predictable lookups beyond the primary key.
  • Transactions or at least atomic single-key operations.
  • Backup, restore, and point-in-time recovery.
  • Replication for read scale-out and failover.

Non-functional requirements#

  • Latency: OLTP reads p99 < 10 ms; writes p99 < 50 ms. Sub-millisecond for in-memory stores.
  • Durability: writes must survive disk failure, machine failure, and (with multi-region replication) region failure. Quantified as nines of durability — 11 nines (99.999999999%) is the AWS S3 marketing number.
  • Throughput: a single Postgres instance comfortably handles 10-50k QPS of mixed workload; sharded Cassandra clusters push past 1M writes/sec.
  • Consistency: pick a target on the linearizable → eventually-consistent axis. Most relational databases default to read-committed, not serializable.

High-level design#

┌──────────── primary ────────────┐
clients ──> │ query parser → planner → exec │
│ │ │
│ buffer pool ──> storage │ (WAL, datafiles)
└──────────────│──────────────────┘
┌────────┴────────┐
│ │
replica 1 replica 2 (read-only, async)

A typical OLTP database: a single primary accepts writes, streams its WAL (write-ahead log) to replicas. Replicas serve reads. Failover promotes a replica when the primary dies. Sharded databases multiply this — one primary per shard, often with each shard replicated 3-way.

Detailed design#

Picking the shape#

WorkloadShapeWhy
Bank ledger, ecommerce checkoutRelationalTransactions, joins, schema discipline
Product catalog, CMS pages, user profilesDocumentFlexible nested data, no cross-doc joins
Time-series, IoT, click logsWide-columnWrite-optimized, partition-key-driven
Session store, feature flags, leaderboardKey-valueSub-ms latency, single-key access
Friend-of-friend, fraud rings, recommendationsGraphMulti-hop traversal in one query

Storage engines#

B-tree (Postgres, MySQL InnoDB): balanced on-disk tree, optimized for reads. Updates rewrite pages in place. Random I/O on writes, fast lookups.

LSM-tree (Cassandra, RocksDB, LevelDB): writes go to a memtable, flushed sequentially to immutable SSTables, compacted in background. Sequential I/O on writes (great for SSDs and write-heavy loads), reads must check multiple SSTables (bloom filters help).

Indexes#

Primary index: the table itself, sorted by primary key. Secondary indexes: separate B-tree (or LSM) keyed by some other column, pointing back to the primary row.

The cost of indexes is write amplification: every insert updates the primary + every secondary index. The classic foot-gun is over-indexing an OLTP table — write latency triples.

Composite indexes only help queries whose WHERE clauses use the leading columns. An index on (user_id, created_at) answers WHERE user_id = ? AND created_at > ?. The same index does not answer WHERE created_at > ? alone.

Transactions and isolation#

The four ANSI isolation levels (from weakest to strongest): read-uncommitted, read-committed, repeatable-read, serializable. Most databases default to read-committed; Postgres treats repeatable-read as snapshot isolation. True serializability is expensive — usually implemented via SSI (Serializable Snapshot Isolation) or 2PL.

Cross-shard transactions are dramatically more expensive than single-shard. Two-phase commit (2PC) is the textbook answer; modern systems (Spanner, CockroachDB) use TrueTime or hybrid logical clocks to avoid 2PC’s coordination overhead.

Replication#

synchronous — primary waits for replica ack. Zero data loss, latency = max(primary, slowest replica).
asynchronous — primary acks before replica catches up. Lower latency, data loss window = replication lag.
quorum — primary waits for K of N replicas. Tunable per-write (DynamoDB's W, Cassandra's CL).

Sharding#

When a single primary can’t keep up:

  • Hash shardinghash(key) mod N. Uniform distribution, hard to scan ranges, hard to resize. Use consistent hashing to soften resize.
  • Range sharding — sort by key, slice into ranges. Range scans are cheap; hot ranges are the failure mode (auto-incrementing IDs land all writes on the last shard).
  • Directory sharding — a lookup service maps each key to a shard. Most flexible, requires an extra hop.

Trade-offs#

SQL / relational — schema discipline catches bugs at migration time; joins keep app code simple; ACID gives transactional safety. Scaling writes past one primary is non-trivial (Vitess, Citus, partitioning).
NoSQL — horizontally scales writes by partitioning out of the box; flexible schema speeds early iteration. Application now owns cross-collection consistency, denormalization, and many concerns the database used to handle.

Other tensions:

  • Normalize vs denormalize — relational defaults to normalize; document and wide-column default to denormalize. Denormalizing speeds reads but multiplies writes and risks drift.
  • Strong consistency vs availability — CAP says you pick during a partition; PACELC says you also pick the latency / consistency trade-off in the absence of partitions.
  • Operational maturity — Postgres has 30 years of tooling, drivers, and operator knowledge. A newer database may fit your workload but cost you weeks the first time it misbehaves at 3 AM.

Real-world examples#

  • PostgreSQL powers Instagram (sharded with Citus), Reddit, GitLab, and most YC-stage startups. The Swiss Army knife.
  • MySQL at Facebook handles trillions of rows across thousands of shards via the MyRocks (LSM) engine.
  • Cassandra runs Discord’s message store: hundreds of billions of messages across 177 nodes, partitioned by (channel_id, bucket).
  • DynamoDB is Amazon’s bet on hash-partitioned KV. Serverless, single-digit-ms p99, used by Lyft, Snap, and Airbnb for hot paths.
  • MongoDB is the default document store; Atlas Search adds inverted-index search alongside.
  • Neo4j powers LinkedIn’s people-you-may-know and a chunk of NASA’s incident knowledge graph.
  • Spanner (Google) is the production proof that you can have horizontally-scaled SQL with strong consistency — via TrueTime atomic-clock synchronization.
  • Key-Value Store — the simplest database shape, often the highest-QPS one.
  • Distributed Cache — frequently fronts a slower OLTP database.
  • Blob Store — where large payloads live with the database holding only a reference.
  • Distributed Search — the inverted-index cousin used alongside primary stores.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.