Key-Value Store — System Design · Engineering Playbook

Use cases#

A key-value store maps an opaque key to an opaque value. Three flavors of use case:

Session and feature-flag storage — user clicks → look up “what experiments is user 42 in?” — sub-millisecond, billions of reads/day.
Shopping cart / per-user state — Amazon’s original Dynamo paper was motivated by this. Always-on writes; merging concurrent updates is acceptable.
Application primary store — when access is fundamentally single-key (URL shortener, profile lookup), a KV store is cheaper, faster, and easier to scale than a relational database.

The defining trait: queries are always “get the value for this key” or “put this value at this key”. No joins, no range scans (mostly), no secondary indexes (or limited ones).

Functional requirements#

put(key, value) — store or overwrite.
get(key) — return the current value or null.
delete(key) — remove the key.
Optional: TTL, conditional puts (CAS), atomic counters, range scans within a partition.
Replication and durability across node failures.

Non-functional requirements#

Latency: p99 read under 5 ms within a region, p99 write under 10 ms. In-memory KV stores (Redis) push p99 below 1 ms.
Availability: 99.99% minimum; Dynamo-style designs target 99.999% with multi-master writes.
Throughput: a sharded cluster handles millions of ops/sec horizontally — DynamoDB is documented to handle 89M requests/sec during Amazon Prime Day.
Durability: replication factor of 3 across availability zones gives 11 nines of durability.
Consistency: tunable — strong for critical paths (transactions), eventual for high-throughput paths (analytics, carts).

High-level design#

                    ┌─── hash(key) → ring position ───┐
client ─> coordinator ─> primary replica ─> N replicas │
                  ▲                                    │
                  └── gossip: who's alive, who owns ───┘

ring layout:        [N1] —— [N2] —— [N3] —— [N4] —— [N1 ...]
                                              ▲
                                     hash(key) lands here →
                                     write to N4, N1, N2 (RF=3)

The coordinator is any node in the cluster — there’s no master. It hashes the key onto a logical ring, identifies the responsible nodes (the next N clockwise), and writes to them. Reads can hit any replica.

Detailed design#

Consistent hashing#

Place each node at multiple random positions on a circular hash space (0 ... 2^64). For a key K, walk clockwise from hash(K) to find the responsible nodes. Adding or removing one node moves only 1/N of keys.

hash space (0 ... 2^64)
      ┌─ N1 ─ N3 ─ N1 ─ N2 ─ N1 ─ N3 ─ N2 ─┐
      │                                    │  (each node owns ~256 vnodes)
      └────────────────────────────────────┘

Virtual nodes (vnodes): assign each physical node ~256 random ring positions instead of 1. This smooths the load even when nodes are heterogeneous and makes rebalancing parallelizable.

Replication#

For a target replication factor RF (typically 3), the coordinator writes to the next RF distinct physical nodes on the ring. Across-AZ placement prevents an AZ outage from losing all replicas.

Quorums (W and R)#

A Dynamo-style write is acknowledged when W of RF replicas confirm. A read returns when R of RF reply. The invariant for strong consistency:

W + R > RF

Common tunings:

W=1, R=1 — fastest, weakest consistency, will read stale data.
W=2, R=2, RF=3 — quorum, strongly consistent, tolerates one node down.
W=3, R=1, RF=3 — write-all/read-one, fast reads but writes block on slow nodes.
W=1, R=3, RF=3 — write-one/read-all, fast writes; reads see everything.

Versioning and vector clocks#

Concurrent writes from two clients to the same key need a conflict-resolution scheme. Three options:

Last-write-wins        — wall-clock timestamp on each value; newest wins. Risks clock skew.
Vector clocks          — track per-replica version vectors; surface conflicts to the client.
CRDTs                  — values whose merge function is mathematically commutative (counters, sets).

Vector-clock conflicts are returned to the application as a list of sibling values; the app merges per its semantics. Amazon’s original cart example: union of cart contents (a deleted item may resurrect — accepted trade-off).

Failure detection: gossip#

Every node periodically picks a random peer and exchanges a short message: “here’s what I think the cluster looks like”. Membership and liveness converge across the cluster in O(log N) rounds. Cheap, no central coordinator, handles partitions gracefully.

Suspected-dead nodes are demoted (other replicas take over) but not erased — gossip favors slow consensus on liveness to avoid flapping.

Anti-entropy and hinted handoff#

When a replica is down, the coordinator stores a hint (“when N3 comes back, give it this write”) on a peer. On recovery, N3 receives its missed writes.

For longer outages, Merkle trees efficiently compare replica state: each node hashes ranges of its keyspace in a tree; replicas exchange root hashes, then drill down only where they differ. The repair cost is logarithmic in the divergence.

Trade-offs#

Strong consistency (W+R > RF) — guaranteed-fresh reads, no surprises in application logic. Latency = slowest of N replicas; one slow disk drags every read.

Eventual consistency (W=1, R=1) — every request hits one fast replica; max throughput, max availability. Application must tolerate occasional stale reads and (with concurrent writes) conflicts.

Other dials:

Single-master vs multi-master — single-master simplifies conflict resolution (just don’t have any) but creates a write bottleneck and a failover gap. Dynamo-style multi-master is always writable but pushes conflicts to the application.
In-memory vs disk-backed — Redis is in-memory with optional persistence; DynamoDB is SSD-backed with an in-memory tier. In-memory is 10-100× faster but bounded by RAM.
Range scans — Cassandra supports them within a partition key; pure hash-partitioned stores like DynamoDB require a secondary index (LSI/GSI) or query rewriting.

Real-world examples#

DynamoDB — Amazon’s hosted KV. Single-digit-ms p99, auto-sharding, supports W=ALL, R=1 or W=1, R=ALL per request. Powers Lyft’s ride state and Disney’s streaming service.
Cassandra — open-source Dynamo descendant. Discord stores ~200B+ messages across 177 nodes partitioned by (channel_id, time_bucket).
Riak — explicit vector-clock conflict surfacing; popular for shopping carts and IoT before fading in favor of newer entrants.
Voldemort — LinkedIn’s in-house Dynamo-style store, retired 2019 in favor of Espresso.
Redis Cluster — sharded by hash slot (16384 slots), gossip-based, MOVED and ASK redirects steer clients to the right shard.
etcd / Consul — strongly consistent KV for service discovery and configuration; uses Raft, not gossip, because they prioritize consistency over availability.

Distributed Cache — same data structure, different durability stance.
Databases — KV is the simplest shape; everything else trades flexibility for cost.
Sequencer — paired with KV when you need ordered keys (e.g. snowflake IDs as primary keys).
Load Balancers — clients use the same consistent-hash trick to find the right replica.