Key-Value Store
Consistent-hash ring, replication factor, versioning (vector clocks), failure detection. Dynamo-style design.
Use cases#
A key-value store maps an opaque key to an opaque value. Three flavors of use case:
- Session and feature-flag storage — user clicks → look up “what experiments is user 42 in?” — sub-millisecond, billions of reads/day.
- Shopping cart / per-user state — Amazon’s original Dynamo paper was motivated by this. Always-on writes; merging concurrent updates is acceptable.
- Application primary store — when access is fundamentally single-key (URL shortener, profile lookup), a KV store is cheaper, faster, and easier to scale than a relational database.
The defining trait: queries are always “get the value for this key” or “put this value at this key”. No joins, no range scans (mostly), no secondary indexes (or limited ones).
Functional requirements#
put(key, value)— store or overwrite.get(key)— return the current value ornull.delete(key)— remove the key.- Optional: TTL, conditional puts (CAS), atomic counters, range scans within a partition.
- Replication and durability across node failures.
Non-functional requirements#
- Latency: p99 read under 5 ms within a region, p99 write under 10 ms. In-memory KV stores (Redis) push p99 below 1 ms.
- Availability: 99.99% minimum; Dynamo-style designs target 99.999% with multi-master writes.
- Throughput: a sharded cluster handles millions of ops/sec horizontally — DynamoDB is documented to handle 89M requests/sec during Amazon Prime Day.
- Durability: replication factor of 3 across availability zones gives 11 nines of durability.
- Consistency: tunable — strong for critical paths (transactions), eventual for high-throughput paths (analytics, carts).
High-level design#
┌─── hash(key) → ring position ───┐client ─> coordinator ─> primary replica ─> N replicas │ ▲ │ └── gossip: who's alive, who owns ───┘
ring layout: [N1] —— [N2] —— [N3] —— [N4] —— [N1 ...] ▲ hash(key) lands here → write to N4, N1, N2 (RF=3)The coordinator is any node in the cluster — there’s no master. It hashes the key onto a logical ring, identifies the responsible nodes (the next N clockwise), and writes to them. Reads can hit any replica.
Detailed design#
Consistent hashing#
Place each node at multiple random positions on a circular hash space (0 ... 2^64). For a key K, walk clockwise from hash(K) to find the responsible nodes. Adding or removing one node moves only 1/N of keys.
hash space (0 ... 2^64) ┌─ N1 ─ N3 ─ N1 ─ N2 ─ N1 ─ N3 ─ N2 ─┐ │ │ (each node owns ~256 vnodes) └────────────────────────────────────┘Virtual nodes (vnodes): assign each physical node ~256 random ring positions instead of 1. This smooths the load even when nodes are heterogeneous and makes rebalancing parallelizable.
Replication#
For a target replication factor RF (typically 3), the coordinator writes to the next RF distinct physical nodes on the ring. Across-AZ placement prevents an AZ outage from losing all replicas.
Quorums (W and R)#
A Dynamo-style write is acknowledged when W of RF replicas confirm. A read returns when R of RF reply. The invariant for strong consistency:
W + R > RFCommon tunings:
W=1, R=1— fastest, weakest consistency, will read stale data.W=2, R=2, RF=3— quorum, strongly consistent, tolerates one node down.W=3, R=1, RF=3— write-all/read-one, fast reads but writes block on slow nodes.W=1, R=3, RF=3— write-one/read-all, fast writes; reads see everything.
Versioning and vector clocks#
Concurrent writes from two clients to the same key need a conflict-resolution scheme. Three options:
Last-write-wins — wall-clock timestamp on each value; newest wins. Risks clock skew.Vector clocks — track per-replica version vectors; surface conflicts to the client.CRDTs — values whose merge function is mathematically commutative (counters, sets).Vector-clock conflicts are returned to the application as a list of sibling values; the app merges per its semantics. Amazon’s original cart example: union of cart contents (a deleted item may resurrect — accepted trade-off).
Failure detection: gossip#
Every node periodically picks a random peer and exchanges a short message: “here’s what I think the cluster looks like”. Membership and liveness converge across the cluster in O(log N) rounds. Cheap, no central coordinator, handles partitions gracefully.
Suspected-dead nodes are demoted (other replicas take over) but not erased — gossip favors slow consensus on liveness to avoid flapping.
Anti-entropy and hinted handoff#
When a replica is down, the coordinator stores a hint (“when N3 comes back, give it this write”) on a peer. On recovery, N3 receives its missed writes.
For longer outages, Merkle trees efficiently compare replica state: each node hashes ranges of its keyspace in a tree; replicas exchange root hashes, then drill down only where they differ. The repair cost is logarithmic in the divergence.
Trade-offs#
W+R > RF) — guaranteed-fresh reads, no surprises in application logic. Latency = slowest of N replicas; one slow disk drags every read. W=1, R=1) — every request hits one fast replica; max throughput, max availability. Application must tolerate occasional stale reads and (with concurrent writes) conflicts. Other dials:
- Single-master vs multi-master — single-master simplifies conflict resolution (just don’t have any) but creates a write bottleneck and a failover gap. Dynamo-style multi-master is always writable but pushes conflicts to the application.
- In-memory vs disk-backed — Redis is in-memory with optional persistence; DynamoDB is SSD-backed with an in-memory tier. In-memory is 10-100× faster but bounded by RAM.
- Range scans — Cassandra supports them within a partition key; pure hash-partitioned stores like DynamoDB require a secondary index (LSI/GSI) or query rewriting.
Real-world examples#
- DynamoDB — Amazon’s hosted KV. Single-digit-ms p99, auto-sharding, supports
W=ALL, R=1orW=1, R=ALLper request. Powers Lyft’s ride state and Disney’s streaming service. - Cassandra — open-source Dynamo descendant. Discord stores ~200B+ messages across 177 nodes partitioned by
(channel_id, time_bucket). - Riak — explicit vector-clock conflict surfacing; popular for shopping carts and IoT before fading in favor of newer entrants.
- Voldemort — LinkedIn’s in-house Dynamo-style store, retired 2019 in favor of Espresso.
- Redis Cluster — sharded by hash slot (
16384slots), gossip-based,MOVEDandASKredirects steer clients to the right shard. - etcd / Consul — strongly consistent KV for service discovery and configuration; uses Raft, not gossip, because they prioritize consistency over availability.
Related building blocks#
- Distributed Cache — same data structure, different durability stance.
- Databases — KV is the simplest shape; everything else trades flexibility for cost.
- Sequencer — paired with KV when you need ordered keys (e.g. snowflake IDs as primary keys).
- Load Balancers — clients use the same consistent-hash trick to find the right replica.