B-Tree Indexes
The classic disk-friendly tree, fanout, leaf-page linking, why every relational DBMS defaults to B-tree.
What it is#
A B-tree is a self-balancing, multi-way search tree where every node holds many keys and has many children — typically hundreds — so that the tree stays shallow even at billions of rows. Every relational database in serious production use (PostgreSQL, MySQL/InnoDB, SQL Server, Oracle, SQLite, DB2) defaults to B-tree for primary and secondary indexes because the structure matches how disks and SSDs actually move data: in fixed-size pages, with a strong preference for sequential access within a page and an acceptance that the next page may be anywhere.
The variant nearly everyone actually ships is the B+ tree: internal nodes hold only routing keys (no row data), all the row data lives at the leaf level, and leaf pages are linked in a doubly-linked list so range scans can walk sideways without going back up the tree. When practitioners say “B-tree”, they almost always mean B+ tree.
When to use it#
B-trees are the default. Reach for one when:
- The predicate is equality or range on indexed columns.
WHERE id = ?,WHERE created_at BETWEEN ? AND ?,WHERE email LIKE 'a%'(prefix match) all use a B-tree. - The data is mutable — rows are inserted, updated, and deleted at OLTP rates. B-trees handle in-place updates well; LSM-trees pay a compaction tax for the same workload.
- You need sorted scans in index order — pagination by
ORDER BY created_at, top-N queries, merge joins. The linked leaf level makes these cheap. - Cardinality is moderate-to-high. With low cardinality (a few distinct values), a bitmap index is denser; for unique-or-nearly-unique columns, B-tree dominates.
Avoid a B-tree when the workload is pure append-only writes at very high rate (LSM wins), when the predicate is full-text search (use an inverted index / GIN), when the data is multi-dimensional with range predicates on multiple dimensions (R-tree, GiST), or when you only ever do exact-match lookups at extreme QPS on a low-mutation table (a hash index can edge it out).
How it works#
Structure#
A B+ tree has three levels of node:
[root page] / | \ [internal] [internal] [internal] / | \ ... ... [leaf][leaf][leaf] ←→ [leaf] ←→ [leaf] ... rows rows rows rows rowsEach node is a page — a fixed-size disk block, typically 8 KB in Postgres, 16 KB in InnoDB. A page holds many keys (the “fanout”), so a tree of depth 4 with fanout 250 already addresses 250^4 ≈ 4 billion keys.
Fanout is the lever. Pack more keys per page and the tree gets shallower; every depth level avoided is one fewer disk seek per lookup. Modern engines pack key prefixes, deduplicate within a page, and use variable-length entries to push the fanout as high as the page size allows.
Lookup#
SELECT * FROM users WHERE id = 42 walks the tree:
- Read the root page; binary-search the keys to find the child page covering key 42.
- Read that internal page; binary-search again.
- Repeat until a leaf page.
- The leaf entry holds the row (for a clustered index) or a pointer to the row in the heap (for a secondary index).
Depth is log_fanout(N). For N = 10^9 and fanout 250, that’s 4 page reads — and the upper levels almost always live in the buffer cache, so usually only the leaf read actually hits disk.
Insert and split#
When a leaf fills up, it splits in half: half the entries stay, half move to a new sibling page, and the parent gets a routing key pointing at the new sibling. If the parent overflows, it splits too — recursively up to the root, which itself splits and grows the tree by one level. This is the only way a B-tree gets deeper.
Splits are page-level writes. The WAL records them so they survive crashes mid-split.
Range scans#
The leaf-level linked list is what makes range scans cheap. WHERE created_at BETWEEN '2026-01-01' AND '2026-02-01':
- Walk the tree once to find the first matching leaf.
- Follow leaf-to-leaf next-pointers, emitting rows until the upper bound is crossed.
No further tree walks. Throughput is bounded by sequential page I/O on the leaves.
Clustered vs secondary#
A clustered index stores the actual row data in the leaf pages, sorted by the index key. InnoDB always builds the primary key as a clustered index; Postgres builds a heap and treats the primary key as a regular B-tree pointing into it. Clustered indexes make primary-key lookups one page read; secondary indexes need an extra hop to the heap.
A secondary index stores (key → row_locator) pairs in the leaves. The row_locator is the primary key (InnoDB) or the heap tuple identifier (Postgres TID). Looking up by secondary index requires walking the secondary index, then walking the clustered index or heap — two trees, two sets of page reads.
Variants#
- B+ tree — leaves linked, internal nodes hold only routing keys. The shipping default.
- B-tree* — variant that defers splits by redistributing to siblings first. Higher fill factor (~2/3 vs 1/2), more complex insert path.
- Covering index (
INCLUDEclause) — a B-tree that stores additional non-key columns at the leaf level so the query can be answered without touching the heap. Postgres 11+ and SQL Server support this directly. Trades disk space for a heap-fetch elimination. - Multi-column / composite index — keys are tuples
(col_a, col_b, col_c)sorted lexicographically. Answers queries on leading prefixes (col_a,col_a + col_b,col_a + col_b + col_c), not on trailing columns alone. - Partial index — indexes only rows matching a predicate (
WHERE deleted_at IS NULL). Smaller, faster to maintain, useful when one value dominates. - Expression index — indexes the result of an expression (
LOWER(email)). Lets indexed lookups use derived values without storing them as columns.
Trade-offs#
Inside the B-tree decision space:
- Adding indexes — every secondary index slows every write to the table by one full B-tree traversal plus a leaf-page write. A table with eight indexes typically has 3–5x the write latency of the same table with one.
- Fill factor — the percentage of each leaf page used at index build time. 100% is densest but every subsequent insert into that range causes a split. 70-80% leaves room for inserts to land without splitting. Postgres defaults to 90%; InnoDB to ~15/16 of the page.
- Index size vs query plan — a huge index that’s mostly cold still costs RAM in the buffer pool. The optimiser may also choose to bitmap-scan a low-cardinality index instead of using it directly, defeating the point.
Common pitfalls#
- Leading-wildcard
LIKE—WHERE email LIKE '%@gmail.com'cannot use a B-tree onemail. The index is sorted left-to-right; without a prefix, every leaf has to be scanned. Fix: store a reversed copy and index that, or use a trigram index (pg_trgm). - Function on the indexed column —
WHERE LOWER(email) = ?won’t use a plain index onemail. Either index the expression (CREATE INDEX ON users (LOWER(email))) or store the lowercased value as a column. - Wrong column order in composite index — an index on
(status, user_id)answersWHERE status = ? AND user_id = ?andWHERE status = ?, but notWHERE user_id = ?alone. Put the most selective equality column first when the access pattern is equality-then-range. - Indexing low-cardinality columns — a B-tree on a boolean column is mostly useless; the planner often picks a sequential scan anyway because most pages match. For low-cardinality columns, a partial index on the rare value (
WHERE is_deleted = true) is usually the right move. - Forgetting write amplification — adding an index to a hot OLTP table can double write latency. Always measure before committing the index to production.
- Bloat in the heap, not the index (Postgres) — under heavy updates, the heap accumulates dead tuples that the index still points at;
VACUUMcleans them up but the index can drift larger than the live data.REINDEX CONCURRENTLYis the operational fix.
Why B-trees, specifically, and not balanced binary trees
A red-black tree or AVL tree has fanout 2 — every node has at most two children. For a billion rows the depth is log_2(10^9) ≈ 30. With fanout 250 (a B-tree) the depth is 4. Disks and SSDs move data in pages of 4-16 KB, not in individual key-sized chunks. Reading a page costs roughly the same as reading any byte of that page. A B-tree fills each page with hundreds of keys; a binary tree wastes most of every page. The depth difference is one disk seek per level avoided — at typical SSD latencies of 100 microseconds, that’s the difference between a sub-millisecond and a multi-millisecond lookup at the leaves.
Related building blocks#