B-Tree Indexes — DBMS · Engineering Playbook

What it is#

A B-tree is a self-balancing, multi-way search tree where every node holds many keys and has many children — typically hundreds — so that the tree stays shallow even at billions of rows. Every relational database in serious production use (PostgreSQL, MySQL/InnoDB, SQL Server, Oracle, SQLite, DB2) defaults to B-tree for primary and secondary indexes because the structure matches how disks and SSDs actually move data: in fixed-size pages, with a strong preference for sequential access within a page and an acceptance that the next page may be anywhere.

The variant nearly everyone actually ships is the B+ tree: internal nodes hold only routing keys (no row data), all the row data lives at the leaf level, and leaf pages are linked in a doubly-linked list so range scans can walk sideways without going back up the tree. When practitioners say “B-tree”, they almost always mean B+ tree.

When to use it#

B-trees are the default. Reach for one when:

The predicate is equality or range on indexed columns. WHERE id = ?, WHERE created_at BETWEEN ? AND ?, WHERE email LIKE 'a%' (prefix match) all use a B-tree.
The data is mutable — rows are inserted, updated, and deleted at OLTP rates. B-trees handle in-place updates well; LSM-trees pay a compaction tax for the same workload.
You need sorted scans in index order — pagination by ORDER BY created_at, top-N queries, merge joins. The linked leaf level makes these cheap.
Cardinality is moderate-to-high. With low cardinality (a few distinct values), a bitmap index is denser; for unique-or-nearly-unique columns, B-tree dominates.

Avoid a B-tree when the workload is pure append-only writes at very high rate (LSM wins), when the predicate is full-text search (use an inverted index / GIN), when the data is multi-dimensional with range predicates on multiple dimensions (R-tree, GiST), or when you only ever do exact-match lookups at extreme QPS on a low-mutation table (a hash index can edge it out).

How it works#

Structure#

A B+ tree has three levels of node:

                  [root page]
                 /     |     \
          [internal] [internal] [internal]
            /  |  \    ...        ...
         [leaf][leaf][leaf] ←→ [leaf] ←→ [leaf] ...
          rows  rows  rows         rows     rows

Each node is a page — a fixed-size disk block, typically 8 KB in Postgres, 16 KB in InnoDB. A page holds many keys (the “fanout”), so a tree of depth 4 with fanout 250 already addresses 250^4 ≈ 4 billion keys.

Fanout is the lever. Pack more keys per page and the tree gets shallower; every depth level avoided is one fewer disk seek per lookup. Modern engines pack key prefixes, deduplicate within a page, and use variable-length entries to push the fanout as high as the page size allows.

Lookup#

SELECT * FROM users WHERE id = 42 walks the tree:

Read the root page; binary-search the keys to find the child page covering key 42.
Read that internal page; binary-search again.
Repeat until a leaf page.
The leaf entry holds the row (for a clustered index) or a pointer to the row in the heap (for a secondary index).

Depth is log_fanout(N). For N = 10^9 and fanout 250, that’s 4 page reads — and the upper levels almost always live in the buffer cache, so usually only the leaf read actually hits disk.

Insert and split#

When a leaf fills up, it splits in half: half the entries stay, half move to a new sibling page, and the parent gets a routing key pointing at the new sibling. If the parent overflows, it splits too — recursively up to the root, which itself splits and grows the tree by one level. This is the only way a B-tree gets deeper.

Splits are page-level writes. The WAL records them so they survive crashes mid-split.

Range scans#

The leaf-level linked list is what makes range scans cheap. WHERE created_at BETWEEN '2026-01-01' AND '2026-02-01':

Walk the tree once to find the first matching leaf.
Follow leaf-to-leaf next-pointers, emitting rows until the upper bound is crossed.

No further tree walks. Throughput is bounded by sequential page I/O on the leaves.

Clustered vs secondary#

A clustered index stores the actual row data in the leaf pages, sorted by the index key. InnoDB always builds the primary key as a clustered index; Postgres builds a heap and treats the primary key as a regular B-tree pointing into it. Clustered indexes make primary-key lookups one page read; secondary indexes need an extra hop to the heap.

A secondary index stores (key → row_locator) pairs in the leaves. The row_locator is the primary key (InnoDB) or the heap tuple identifier (Postgres TID). Looking up by secondary index requires walking the secondary index, then walking the clustered index or heap — two trees, two sets of page reads.

Variants#

B+ tree — leaves linked, internal nodes hold only routing keys. The shipping default.
B-tree* — variant that defers splits by redistributing to siblings first. Higher fill factor (~2/3 vs 1/2), more complex insert path.
Covering index (INCLUDE clause) — a B-tree that stores additional non-key columns at the leaf level so the query can be answered without touching the heap. Postgres 11+ and SQL Server support this directly. Trades disk space for a heap-fetch elimination.
Multi-column / composite index — keys are tuples (col_a, col_b, col_c) sorted lexicographically. Answers queries on leading prefixes (col_a, col_a + col_b, col_a + col_b + col_c), not on trailing columns alone.
Partial index — indexes only rows matching a predicate (WHERE deleted_at IS NULL). Smaller, faster to maintain, useful when one value dominates.
Expression index — indexes the result of an expression (LOWER(email)). Lets indexed lookups use derived values without storing them as columns.

Trade-offs#

B-tree — balanced read/write performance, supports range scans, in-place updates, mature in every production DBMS. Random I/O on writes (each insert may dirty a different page). Write amplification grows with the number of secondary indexes on the table.

LSM-tree — sequential I/O on writes (all goes to a memtable then flushed in order), high write throughput, but reads must check multiple sorted runs and compaction costs CPU and write bandwidth. Range scans need a merge across runs. Better fit for write-heavy or append-only workloads.

Inside the B-tree decision space:

Adding indexes — every secondary index slows every write to the table by one full B-tree traversal plus a leaf-page write. A table with eight indexes typically has 3–5x the write latency of the same table with one.
Fill factor — the percentage of each leaf page used at index build time. 100% is densest but every subsequent insert into that range causes a split. 70-80% leaves room for inserts to land without splitting. Postgres defaults to 90%; InnoDB to ~15/16 of the page.
Index size vs query plan — a huge index that’s mostly cold still costs RAM in the buffer pool. The optimiser may also choose to bitmap-scan a low-cardinality index instead of using it directly, defeating the point.

Common pitfalls#

Leading-wildcard LIKE — WHERE email LIKE '%@gmail.com' cannot use a B-tree on email. The index is sorted left-to-right; without a prefix, every leaf has to be scanned. Fix: store a reversed copy and index that, or use a trigram index (pg_trgm).
Function on the indexed column — WHERE LOWER(email) = ? won’t use a plain index on email. Either index the expression (CREATE INDEX ON users (LOWER(email))) or store the lowercased value as a column.
Wrong column order in composite index — an index on (status, user_id) answers WHERE status = ? AND user_id = ? and WHERE status = ?, but not WHERE user_id = ? alone. Put the most selective equality column first when the access pattern is equality-then-range.
Indexing low-cardinality columns — a B-tree on a boolean column is mostly useless; the planner often picks a sequential scan anyway because most pages match. For low-cardinality columns, a partial index on the rare value (WHERE is_deleted = true) is usually the right move.
Forgetting write amplification — adding an index to a hot OLTP table can double write latency. Always measure before committing the index to production.
Bloat in the heap, not the index (Postgres) — under heavy updates, the heap accumulates dead tuples that the index still points at; VACUUM cleans them up but the index can drift larger than the live data. REINDEX CONCURRENTLY is the operational fix.

Why B-trees, specifically, and not balanced binary trees

A red-black tree or AVL tree has fanout 2 — every node has at most two children. For a billion rows the depth is log_2(10^9) ≈ 30. With fanout 250 (a B-tree) the depth is 4. Disks and SSDs move data in pages of 4-16 KB, not in individual key-sized chunks. Reading a page costs roughly the same as reading any byte of that page. A B-tree fills each page with hundreds of keys; a binary tree wastes most of every page. The depth difference is one disk seek per level avoided — at typical SSD latencies of 100 microseconds, that’s the difference between a sub-millisecond and a multi-millisecond lookup at the leaves.