Hash and Bitmap Indexes
Hash indexes for equality lookups, bitmap indexes for low-cardinality columns, the trade-offs vs B-tree.
What it is#
Hash and bitmap indexes are the two main alternatives to a B-tree when the workload doesn’t match the B-tree’s sweet spot. A hash index maps each key to a bucket via a hash function; lookup is O(1) for equality. A bitmap index stores one bit per row per distinct value of a column; the index for a column with three values across a million rows is three bitmaps of a million bits each, and WHERE color IN ('red', 'blue') is a bitwise OR of two bitmaps.
Both exist for the same reason: B-trees pay for properties (sorted order, balanced depth, range scans) that some workloads don’t need. When the only access pattern is equality and the cardinality is extreme — either very high (hash) or very low (bitmap) — a specialised index can use less space and answer queries faster than a B-tree on the same column.
When to use it#
Reach for a hash index when:
- The query is pure equality —
WHERE user_id = ?— never a range orORDER BY. - Cardinality is high (most values distinct) so the hash distributes well.
- The data set fits in memory, or the engine ships a disk-aware variant (Postgres hash indexes, Oracle hash clusters, MySQL Memory engine).
- Constant-time worst case matters more than sorted-scan throughput — typical in OLTP point lookups, lock managers, in-memory caches.
Reach for a bitmap index when:
- Cardinality is low — gender, country, status enum, boolean flags (say, 2-1000 distinct values).
- Queries combine multiple low-cardinality predicates —
WHERE status = 'active' AND country = 'US' AND tier IN ('gold', 'platinum'). Bitmap AND/OR is essentially free. - The workload is OLAP / DWH — bulk-load, query, occasionally re-load. Bitmap indexes are notoriously expensive to maintain under high write rates.
- The table is large (tens of millions of rows up) — bitmap savings come from packing many bits per byte; small tables get no benefit.
How it works#
Hash index#
A hash index is a hash table on disk. The structure has three pieces:
- Hash function — maps a key to a bucket number. Postgres uses a Murmur-style hash; Oracle uses its own.
- Bucket directory — array of pointers, one per bucket. Often power-of-two sized so growth is “double the directory”.
- Buckets — each bucket is a page (or chain of pages) holding
(key, row_locator)pairs that hashed there.
Lookup for WHERE id = 42:
- Compute
hash(42)to get a bucket number. - Read that bucket’s page.
- Scan the page for the entry matching
42. Done.
That’s one page read in the best case, two if the bucket has overflowed. No tree walk, no comparison cascade.
Collisions are handled by chaining (overflow pages) or by linear/extendible hashing schemes that re-partition the directory as buckets fill. Postgres uses linear hashing; the directory grows one bucket at a time as the load factor climbs.
No range scans. WHERE id > 100 AND id < 200 cannot use a hash index — the hash function destroys ordering. The planner falls back to a sequential scan or another index.
No ordered output. ORDER BY on a hash-indexed column still requires a sort.
Bitmap index#
For a column with D distinct values across N rows, a bitmap index is D bitmaps each of N bits. The bit at position i in bitmap v is 1 if row i has value v, else 0.
For status with values active, pending, churned, banned:
row: 1 2 3 4 5 6 7 8status: a a p c a b a p
bitmap_a: 1 1 0 0 1 0 1 0bitmap_p: 0 0 1 0 0 0 0 1bitmap_c: 0 0 0 1 0 0 0 0bitmap_b: 0 0 0 0 0 1 0 0WHERE status = 'active' AND country = 'US' is a bitwise AND of bitmap_a (from the status index) and bitmap_US (from the country index). Producing the result bitmap is a single CPU-vectorised pass at memory bandwidth speed.
Compression matters. A naïve bitmap is N bits per distinct value — for 1B rows and 100 values, that’s 12.5 GB. In practice bitmaps are compressed with run-length schemes like WAH, EWAH, or Roaring; the compressed bitmaps for low-cardinality columns shrink 10-100x with bitwise ops still cheap on the compressed form.
Updates are expensive. Inserting a row updates one bit in one bitmap (fine). Updating a row’s value flips one bit in the old bitmap and one in the new (locks both bitmaps). Deleting rows requires a tombstone bitmap or a rebuild. At OLTP write rates this is painful — bitmap indexes are mostly used in OLAP / DWH where loads are batch.
How the planner uses both#
A common Postgres plan is bitmap heap scan on a B-tree index: the planner walks the B-tree to produce a list of matching TIDs, builds an in-memory bitmap of them, ORs/ANDs with bitmaps from other index walks, then fetches the result rows from the heap in TID order. This isn’t a stored bitmap index — it’s a runtime bitmap built from a B-tree — but the same combinatorial benefit applies: multiple B-trees can be combined at query time.
Variants#
- Linear hashing — grow one bucket at a time, splitting as the load factor exceeds a threshold. Postgres uses this.
- Extendible hashing — variable-depth directory; lookup uses the top
kbits of the hash for the current depth. Doubles the directory when any bucket overflows. Used in some embedded engines. - Cuckoo hashing — two hash functions, two candidate buckets, displace-on-insert. Worst-case
O(1)lookups; popular in in-memory hash tables and some NewSQL engines. - Hash clusters (Oracle) — physically store rows in the buckets of a hash on the cluster key; lookup is one page read. Faster than a hash index because there’s no second hop to the heap.
- WAH / EWAH bitmap compression — word-aligned compressed bitmaps that still support cheap bitwise ops. Standard in Oracle bitmap indexes.
- Roaring bitmaps — chunked bitmap representation that picks between array, bitmap, and run encodings per 16-bit chunk. Used in Druid, Pinot, Lucene, and increasingly in OLAP engines.
- Bit-sliced indexes — represent each value with
log2(D)bitmaps (one per bit position). Reduces space whenDis large; supports range predicates as bitmap arithmetic.
Trade-offs#
O(1) equality lookup, smaller than B-tree for the same row count, no balancing logic to maintain. Cannot do range scans, cannot serve ORDER BY, sensitive to hash quality, larger memory footprint per bucket if load factor is low. O(log N) equality but constant factors are small; range scans, ordered scans, LIKE 'prefix%' all supported; the default for almost every workload. Slightly larger on disk; deeper trees mean more page reads at the bottom. Common pitfalls#
- Adding a hash index to a column you also range-query — the hash index will sit unused for the range queries; you pay write amplification for both indexes for no benefit. Pick the access pattern first.
- Bitmap indexes on a high-update table — every UPDATE locks the bitmap for the old and new values; at OLTP write rates this serialises. Some engines (Oracle) explicitly recommend against bitmap indexes on tables with frequent DML.
- Bitmap on a high-cardinality column — millions of bitmaps means millions of tiny files / blocks. Cardinality < 1000 is the comfortable zone; over that, bit-sliced indexes or B-tree are better.
- Confusing bitmap index with bitmap scan — Postgres has bitmap scans (runtime AND/OR over B-tree results) but historically does not have stored bitmap indexes. Extensions like
pg_bitmapexist; without one, you’re using B-tree under the hood. - Hash index durability gaps (older Postgres) — pre-10.0, Postgres hash indexes were not WAL-logged and didn’t survive replication or crash. Modern versions are crash-safe, but old advice on the internet still warns against them.
- Forgetting
ORDER BY— a query that wants sorted output (pagination, top-N) cannot benefit from a hash index. The planner adds a sort and the index becomes pure overhead.
Why hash indexes are rare in practice despite being faster on paper
In a benchmark on a single column with pure equality lookups, hash beats B-tree on lookup latency. In practice three things stop hash indexes from taking over. First, real workloads almost always need some range or ordered access on the same column — ORDER BY created_at DESC LIMIT 20 shows up everywhere. Second, B-tree constants are small: a 4-deep B-tree with the top 3 levels cached is also one page read at the leaf. Third, B-tree handles concurrent inserts via well-understood lock coupling; hash indexes need bucket-level locking that’s just as complex with no big win. The result is that hash indexes survive in niche slots — primary-key columns on truly read-mostly tables, in-memory engines, lock manager hash tables — while B-tree carries the rest.
Related building blocks#