LSM-Tree Storage — DBMS · Engineering Playbook

What it is#

An LSM-tree (log-structured merge-tree) is a write-optimised on-disk index structure that buffers writes in memory and flushes them to immutable, sorted files on disk, periodically merging older files into larger ones. The structure was formalised by O’Neil, Cheng, Gawlick, and O’Neil in 1996, but it became the dominant storage format only after Google’s Bigtable and LevelDB papers re-popularised it in the late 2000s. Today it underpins RocksDB, LevelDB, Cassandra, ScyllaDB, HBase, DynamoDB’s storage nodes, CockroachDB, TiKV, InfluxDB, and most modern embedded KV stores.

The core trade is write throughput for read complexity. A B-tree writes each update in place — random I/O across pages. An LSM-tree batches writes into a memory buffer, then writes them out sequentially in sorted runs. Sequential writes are 10-100x faster than random writes on every storage medium ever shipped. The cost: reads may have to consult multiple sorted runs to find the latest value for a key, and a background compaction process must continually merge runs to keep read amplification bounded.

When to use it#

Pick LSM storage when:

The workload is write-heavy — telemetry, event ingestion, time-series, write-heavy KV, log analytics. LSM ingest rates of 100K-1M writes/sec per node are achievable on commodity hardware.
Writes are predominantly inserts or blind updates (no read-before-write). LSMs shine when the write path doesn’t need to consult the current value.
The data is sortable by key and queries are dominated by point lookups or short range scans within a key prefix. Cassandra-style partition + clustering keys, RocksDB column families, DynamoDB partition + sort keys all fit.
Compression is in scope. SSTables compress beautifully (10x+ for typical KV data) because each run is sorted and most blocks are homogeneous.

Avoid LSM when:

The workload is read-mostly with random point lookups and write rates are modest — a B-tree’s in-place updates and shallower lookup path win.
Read latency p99 must be tight without operational tuning. Worst-case reads on an LSM touch multiple runs and can stall during compaction; B-trees have flatter latency curves out of the box.
Transactional semantics span many keys with read-your-writes guarantees and complex isolation. LSMs can do this (RocksDB transactions, CockroachDB) but the engineering effort is significant.

How it works#

The three layers#

                 writes
                   │
                   ▼
        ┌────── memtable ──────┐     in memory, ~64 MB
        │  sorted (skip list)  │     also a WAL on disk
        └──────────┬───────────┘
                   │ flush when full
                   ▼
        ┌─── Level 0 SSTables ──┐    immutable, sorted by key
        │  may overlap in range │    within each file
        └──────────┬────────────┘
                   │ compact
                   ▼
        ┌─── Level 1 SSTables ──┐    each level 10x bigger
        │  non-overlapping     │     than the previous
        └──────────┬────────────┘
                   │ compact
                   ▼
              Level 2, 3, ...

The memtable is an in-memory sorted structure (typically a skip list or a B-tree); every write goes here first. To survive crashes, every write also appends to a write-ahead log on disk before the memtable accepts it. When the memtable hits its size threshold (64-128 MB is typical), it’s frozen and a new memtable takes over.

The frozen memtable is flushed to disk as an SSTable (Sorted String Table) — an immutable file containing all the memtable’s KV pairs sorted by key, plus a sparse index and a Bloom filter. Once written, SSTables are never modified; updates and deletes become new entries in later SSTables.

Reads#

Looking up a key:

Check the active memtable.
Check any frozen memtables not yet flushed.
Check each SSTable from newest to oldest. For each, the Bloom filter says “definitely not here” or “maybe here”; on “maybe”, read the index block, then the data block.
First match wins (newest version).

A deletion is a tombstone — a key with a deletion marker. Reads encountering a tombstone return “not found” even if older SSTables still contain the key. Compaction eventually drops the tombstone and the older versions together.

The number of SSTables consulted on a read is read amplification. Bloom filters and the level structure keep it bounded; without compaction it would grow with every flush.

Compaction#

Compaction merges multiple SSTables into fewer, larger SSTables, dropping superseded versions and tombstones. Two main strategies:

Leveled compaction (LevelDB, RocksDB default). Each level is ~10x bigger than the previous. When a level overflows, one of its SSTables is merged with the overlapping files in the next level. Read amplification is bounded (one file per level), write amplification is high (each byte rewritten ~10 times on its way from L0 to the deepest level).
Size-tiered compaction (Cassandra default). When N (typically 4) SSTables of similar size accumulate, they’re merged into one larger SSTable. Lower write amplification (~2-4x), higher space amplification and read amplification because more overlapping files may exist.
Tiered + leveled hybrid (RocksDB universal, ScyllaDB ICS). Tiered at the lower levels, leveled at the upper levels. Picks compromises per level.

Compaction runs continuously in background threads. The single biggest operational variable in an LSM is how aggressively to compact — too slow and reads degrade and space balloons; too fast and CPU and write bandwidth get consumed by compaction instead of foreground writes.

Bloom filters#

Each SSTable carries a Bloom filter over its keys. On a point lookup, the engine checks the filter first; a “no” definitively skips the SSTable’s data blocks. False-positive rates of 1% are typical with ~10 bits per key; pushing to 0.1% costs ~16 bits per key and pays off when reads dominate.

The three amplifications#

Write amplification — total bytes written to disk per byte of user data. Leveled compaction is 10-30x; size-tiered is 2-4x.
Read amplification — files consulted per read. Leveled: ~1 per level (typically 4-5). Size-tiered: bounded by the number of tiers.
Space amplification — disk used per byte of live data. Live data plus tombstones plus pre-compaction versions. Leveled keeps it under ~1.1x; size-tiered can hit 2x or more.

The LSM tuning problem is picking a strategy and parameters that minimise the amplification that matters for your workload while keeping the others acceptable.

Variants#

LevelDB / RocksDB — the canonical embedded LSM library; powers MySQL MyRocks, CockroachDB (until Pebble), TiKV, ArangoDB, Ceph BlueStore, and dozens more.
Pebble — Cockroach Labs’s Go-native rewrite of RocksDB. Same model, no CGo.
Cassandra / ScyllaDB — distributed wide-column stores with LSM per node; ScyllaDB rewrote Cassandra in C++ with a thread-per-core architecture.
HBase — Bigtable-style on HDFS; LSM with a Hadoop-shaped operational story.
InfluxDB TSI / IOx — LSM tuned for time-series with timestamp-prefixed keys.
WiscKey — separate key index from value log; only keys live in the LSM, values are written once to a value log. Reduces write amplification dramatically at the cost of more random reads.
Tiered storage LSMs — keep recent levels on SSD, older levels on object storage (S3). RocksDB tiered, ClickHouse Keeper, Pinot, Druid.

Trade-offs#

LSM-tree — sequential writes, high ingest throughput, excellent compression, scales to PB on commodity disks. Read amplification (multiple files per lookup), background compaction competes with foreground work, tail-latency spikes during big compactions, space amplification varies with strategy.

B-tree — in-place updates, low and predictable read latency, well-understood concurrency. Random I/O on writes, write amplification from page splits, harder to compress (sparser pages), lower peak write throughput.

Operational dimensions:

CPU vs disk balance. LSMs lean on CPU for compaction; B-trees lean on disk for random I/O. On modern NVMe with many cores, LSMs scale ingest by adding compaction threads.
Tail latency. Stalls during L0 → L1 compaction (or “write stalls” in RocksDB) are a recurring operational concern. Tune level0_slowdown_writes_trigger and friends or accept the spikes.
Range scans. A range scan must merge sorted iterators from every relevant SSTable plus the memtable. Cheaper than it sounds but more work than a B-tree leaf walk.
Multi-key transactions. RocksDB supports them with optimistic and pessimistic variants; the engineering is real and the overhead non-trivial.

Common pitfalls#

Forgetting compaction is a foreground tax. Compaction reads from disk, writes to disk, and burns CPU. If you size the host assuming only ingest and queries, compaction will steal headroom under load. Plan for compaction to use 30-50% of total disk bandwidth.
Tuning the LSM blindly. RocksDB has 50+ tunables; touching them without a workload model usually makes things worse. Start from the defaults, measure write/read/space amplification, change one knob at a time.
Letting L0 grow unbounded. If flush rate exceeds compaction rate, L0 SSTables accumulate, reads slow, and eventually the engine triggers write stalls. Fix: more compaction threads, smaller memtables, faster disks — not bigger memtables.
Bloom filter false positives on range scans. Bloom filters help point lookups; range scans must consult every SSTable that overlaps the range regardless of the filter. Workloads dominated by ranges don’t get the Bloom benefit.
Big values + leveled compaction. Every byte of a large value is rewritten 10-20x as it migrates down the levels. WiscKey-style key-value separation or moving large blobs to a separate store solves this.
Backups during compaction. Snapshotting an LSM during heavy compaction can capture an unusually large set of pre-compaction SSTables. Coordinate with compaction or use the engine’s checkpoint API.

Why LSMs won the post-2010 storage decade

Three shifts converged. SSDs made sequential write throughput a real differentiator: B-trees used random writes anyway, so SSD endurance and bandwidth went underused. The cloud’s pricing model — storage by the GB, IOPS by the operation — rewarded compression and write batching. And the 2010s data shape — telemetry, events, append-mostly history — was a perfect LSM workload. Bigtable proved the model at scale, LevelDB made it embeddable, RocksDB made it tunable and durable, and a decade of engines (Cassandra, DynamoDB, CockroachDB, Pebble) shipped on top. B-trees never lost OLTP, but LSMs took everything else.