Row-Store vs Column-Store — DBMS · Engineering Playbook

What it is#

A storage layout is the rule by which a database lays out a table’s bytes on disk. Row-stores keep each row’s columns contiguous — the bytes for one tuple sit next to each other, then the next tuple, then the next. Column-stores keep each column’s values contiguous — every value for created_at sits next to every other created_at, then every user_id, then every amount. Same logical table, fundamentally different physical layout, dramatically different performance envelopes.

OLTP engines (PostgreSQL, MySQL/InnoDB, Oracle, SQL Server) ship row-stores by default because their workload reads and writes whole rows. OLAP engines (ClickHouse, Snowflake, BigQuery, Redshift, DuckDB, Vertica, Parquet on a lake) ship column-stores because their workload scans a few columns out of many across billions of rows. The layout choice is the single biggest reason an analytical query that takes seconds on Snowflake takes hours on Postgres on the same data.

When to use it#

Pick a row-store when:

The workload is point lookups and small range scans by primary key — SELECT * FROM orders WHERE id = ? returns whole rows in one page read.
Mutations are row-shaped — INSERT, UPDATE, DELETE touch all columns of a tuple together.
Working-set hit rate on the buffer pool matters more than scan throughput.
Latency budgets are sub-10 ms p99 and concurrency is in the hundreds-to-thousands of transactions per second.

Pick a column-store when:

Queries are wide scans over a few columns — SELECT SUM(amount) FROM orders WHERE created_at > ? touches two columns out of twenty.
Tables are large (10s of GB to PB) and mostly append-only — analytical fact tables, event logs, telemetry.
Compression matters — homogeneous columns compress 5-20x; reducing bytes read is reducing latency.
Vectorised execution is in scope — operating on a column chunk at a time is what makes modern OLAP fast.

How it works#

The same table, two layouts#

For a table orders(id, user_id, created_at, amount, status):

Row-store on disk:
  [id=1 user=42 ts=t1 amt=10 status=ok][id=2 user=43 ts=t2 amt=20 ...] ...

Column-store on disk:
  ids:        [1, 2, 3, 4, 5, ...]
  user_ids:   [42, 43, 42, 44, 42, ...]
  created_at: [t1, t2, t3, t4, t5, ...]
  amount:     [10, 20, 30, 40, 50, ...]
  status:     [ok, ok, fail, ok, ok, ...]

A point lookup by id is one page read in the row-store; in the column-store it’s five reads (one per column) plus a way to align them by position. A SUM(amount) across a billion rows is a billion column-value reads in the row-store (each row pulls all columns into the buffer pool) or one contiguous column scan in the column-store.

Compression as a force multiplier#

Column values are homogeneous — same type, often same scale, often heavily repeating. That makes them compress beautifully:

Run-length encoding (RLE) — [ok, ok, ok, fail, ok, ok] becomes [(ok, 3), (fail, 1), (ok, 2)]. Best for sorted or low-cardinality columns.
Dictionary encoding — replace each string with a small integer ID into a per-block dictionary. Status column with 4 values becomes 2 bits per row instead of 8 bytes.
Delta encoding — store the difference from the previous value. Great for sorted timestamps and monotonic IDs.
Bit-packing — if a column has values 0-7, store each in 3 bits instead of 32 or 64.
General-purpose (LZ4, Zstd) on top of all of the above.

5x-20x compression is normal; 50x+ happens for low-cardinality categorical columns. Compression isn’t just storage savings — it’s fewer bytes to read off disk and fewer bytes through the CPU’s memory hierarchy, which is the actual bottleneck in OLAP.

Vectorised execution#

A column-store can apply a predicate to a whole block of values at once using SIMD instructions — modern CPUs process 8 or 16 int64 comparisons per cycle. Row-by-row execution touches one row’s columns at a time and gets none of this. The performance gap from vectorisation alone is often 10-100x, before any I/O or compression considerations.

Late materialisation#

SELECT user_id, SUM(amount) FROM orders WHERE status = 'paid' GROUP BY user_id on a column-store:

Scan status column; produce a bitmap of matching positions.
Use the bitmap to read only matching positions from user_id and amount.
Aggregate.

The id, created_at, and any other columns are never read. This late materialisation — only assemble rows when you absolutely must — is what makes column-stores fast on wide tables.

Variants#

Pure column-store (Vertica, MonetDB, DuckDB, ClickHouse, Snowflake) — one layout for the whole engine; OLTP point writes are slow.
PAX / hybrid pages (Parquet, ORC) — within a fixed-size page (the “row group” / “stripe”), data is laid out column-by-column, but the page boundary aligns row sets. Lets a single page deliver many rows of a few columns or a few rows of many columns.
Columnstore index (SQL Server, Postgres + Citus columnar) — secondary structure over a row-store base table. Adds the OLAP layout without giving up OLTP behavior.
Delta + base (Hudi, Iceberg, Delta Lake) — recent writes go into a small row-shaped delta; periodic compaction merges into columnar base files. Closes the OLTP-on-OLAP gap for append + occasional update.
In-memory column-store (Oracle In-Memory, SAP HANA) — column layout in RAM only; row-store on disk. Trades memory for query speed without touching the storage subsystem.

Trade-offs#

Row-store — fast point lookups, fast row-shaped writes, well-understood index structures (B-tree), low per-row overhead. Cold scans over wide tables read every column even when the query needs three. Compression ratios in the 2-3x range at best.

Column-store — scan a few columns out of many at full disk bandwidth, 5-20x compression, vectorised execution. Single-row writes are slow (every column file gets touched); deletes typically use tombstones; selecting SELECT * is the worst-case shape — defeats the whole point.

Operational consequences:

Storage. A column-store of the same logical data is often 3-10x smaller on disk after compression. Cloud-warehouse pricing is usually scan-bytes-based; column-stores save real money.
Concurrency. Row-stores handle 10K+ concurrent OLTP transactions on commodity hardware. Column-stores typically run a few dozen heavy analytical queries concurrently — the engine assumes each query is big.
Mutability. Row-stores update in place (or with MVCC). Column-stores rewrite or use delta tables; a row update touches every column file. ETL pipelines into a column-store usually batch.
Indexing. Row-stores live and die by B-tree indexes; column-stores often use only min-max statistics per block (zone maps), which is enough because scans are cheap.

Common pitfalls#

SELECT * on a column-store — defeats the layout. Every column file is read. Always project explicitly.
Wide tables on a row-store — a 200-column table where most queries use 5 columns wastes 97% of the buffer pool on data never touched. Either project narrower views or move to a column-store.
Row-by-row inserts into a column-store — a streaming write of one row per call into Redshift or BigQuery is a path of pain. Batch into 10K-1M rows per write.
Sorting matters for column-stores too — RLE, delta, and zone-map effectiveness all collapse if data isn’t sorted on the columns that drive predicates. Pick a sort key (ClickHouse ORDER BY, Snowflake clustering key, Redshift sort key) deliberately.
Treating a column-store as a transactional database — late-arriving updates against a delta-base table can lag for tens of seconds to minutes. If you need read-your-writes within milliseconds, you need a row-store or an HTAP design.
Mixed workload on a row-store buffer pool — one big analytical scan can evict the OLTP working set and tank latency for unrelated queries. Either isolate replicas or move the scan to a column-store mirror.

Why 1985 row-stores still dominate OLTP

The OLTP workload is shaped like the row-store: every read or write touches a small number of rows, all columns of those rows, and the next transaction will probably touch different rows. A row-store’s working set is the recently-touched rows. A column-store’s working set is the recently-touched columns — which under OLTP is “all of them, scattered across files”. The buffer pool’s hit rate, the WAL’s write amplification, and the lock manager’s row-level granularity all assume rows are the unit of work. Forty years later, no one has found a layout that beats row-store on this workload — the closest are in-memory engines that store rows in a more compact format, not engines that change the layout shape.