PostgreSQL — The Reference Open-Source RDBMS — DBMS

What it is#

PostgreSQL is the open-source relational database that has, over thirty years, become the reference implementation of “boring, correct, extensible SQL”. It traces back to the POSTGRES research project at Berkeley (Stonebraker, 1986), gained SQL in 1996, and has shipped a major release roughly every year since. The current line (Postgres 16/17 era) handles tens of thousands of mixed transactions per second on a single node, scales reads via streaming replication, and is the default starting point for any new system that doesn’t have a strong reason to pick something else.

Three properties make Postgres the default choice. First, soundness — features ship when they’re correct, not when they’re fashionable; isolation levels mean what the manual says they mean. Second, extensibility — custom types, custom operators, custom index methods, foreign data wrappers, and procedural languages are first-class and stable. Third, the ecosystem — PostGIS, TimescaleDB, Citus, pgvector, pgbouncer, logical replication, every major cloud’s managed offering. Most modern data needs can be served by Postgres plus an extension.

Architecture overview#

Postgres uses a process-per-connection model. The postmaster (postgres parent process) listens on a TCP/UDS socket; each new client connection forks a dedicated backend process that handles that session’s queries from BEGIN to COMMIT. Backends share memory (the buffer pool, the WAL buffer, lock tables) via System V or POSIX shared memory; everything else is per-process.

                 ┌──────── postmaster ────────┐
clients ──TCP──> │   accept(), authenticate    │
                 │       fork backend          │
                 └──────────────┬──────────────┘
                                │
       ┌────────────────────────┼────────────────────────┐
       │                        │                        │
  ┌─backend─┐              ┌─backend─┐              ┌─backend─┐
  │ parser  │              │ parser  │              │ parser  │
  │ planner │              │ planner │              │ planner │
  │ exec    │              │ exec    │              │ exec    │
  └────┬────┘              └────┬────┘              └────┬────┘
       │                        │                        │
       └────── shared memory: buffer pool, WAL, locks ────┘
                                │
                                ▼
            ┌──────── auxiliary processes ────────┐
            │  WAL writer  · checkpointer          │
            │  autovacuum launcher + workers       │
            │  background writer · stats collector │
            │  walsender (per replica)             │
            └──────────────────────────────────────┘
                                │
                                ▼
                        data directory on disk
                        (PGDATA: base/, pg_wal/, pg_xact/, ...)

Auxiliary processes handle non-query work — flushing dirty buffers, writing the WAL, vacuuming dead tuples, gathering statistics, shipping WAL to replicas. Each is a separate OS process, which makes them easy to schedule and easy to monitor (ps, top) but means Postgres has more processes alive than a thread-per-connection database like MySQL.

The process-per-connection choice has one famous consequence: each backend costs ~10 MB of RSS plus its query-time work memory. A workload with thousands of concurrent connections wants pgbouncer or a similar connection pooler in front; without one, the OS process table becomes the bottleneck. This is the single most common operational footgun in Postgres deployments.

Storage and indexing#

The base storage unit is the heap — a collection of 8 KB pages per table, holding rows in insertion order with no inherent sort. Postgres does not maintain a clustered primary key; every index is a secondary index that points back into the heap via a 6-byte tuple identifier (TID = (block_number, offset_in_block)).

Default index type is the B-tree. Postgres also ships with hash, GiST, SP-GiST, GIN, and BRIN out of the box, each suited to a different access pattern:

Index type	Best for
B-tree	Equality and range on scalar columns (the default)
Hash	Equality-only, niche; rarely worth picking over B-tree
GiST	Geometric, range types, full-text, custom extensible
SP-GiST	Space-partitioned: tries, quad-trees
GIN	Inverted index — arrays, JSONB containment, full-text
BRIN	Block-range summaries — huge, naturally-clustered tables

A row is identified by its TID. Updates don’t overwrite in place: Postgres uses MVCC, so an UPDATE creates a new tuple version in the heap and marks the old one’s xmax to the updating transaction’s ID. The old version stays until VACUUM reclaims it. This makes reads concurrent with writes (snapshot isolation is free) but means a heavily-updated table accumulates bloat — dead tuples that take space and slow scans.

The TOAST (The Oversized Attribute Storage Technique) mechanism transparently moves values larger than ~2 KB into a sidecar table, optionally compressed. This is what lets Postgres store multi-megabyte JSONB documents and gigabyte text columns without breaking the 8 KB page model.

Query and transaction execution#

A query takes the following path through a backend:

Parser turns SQL text into a parse tree. Catches syntax errors here.
Rewriter applies rules — most importantly, expanding views into their underlying queries.
Planner / optimiser is cost-based. It enumerates join orders and access paths, costs each using table statistics gathered by autovacuum / ANALYZE, and picks the cheapest. The planner is one of the deeper parts of the codebase and the biggest reason to trust the system on hard queries.
Executor runs the chosen plan as a tree of physical operators (sequential scan, index scan, nested loop join, hash join, merge join, sort, materialize, …).

EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and reports actual row counts and timings per operator. Discrepancy between estimated and actual row counts is the first thing to look at when a query is slow — usually statistics are stale and ANALYZE will fix it.

Transactions are managed via MVCC. Each transaction gets a snapshot — a list of transaction IDs that were in progress when the snapshot was taken — and reads only tuples that were committed before the snapshot and not yet deleted by a transaction outside the snapshot. The default isolation level is READ COMMITTED; the snapshot is taken at the start of each statement. REPEATABLE READ takes one snapshot at the start of the transaction and uses it throughout — this is true snapshot isolation. SERIALIZABLE layers Serializable Snapshot Isolation (SSI) on top, detecting dependency cycles and aborting one transaction in the cycle.

The result: readers never block writers and writers never block readers. The only contention is between concurrent writers on the same row, which take row-level locks via xmax. This is the same general approach MySQL InnoDB and Oracle take; Postgres’s variant is the cleanest published implementation.

Replication and durability#

Durability rests on the write-ahead log (pg_wal). Every change to a data page first writes a WAL record describing the change; the record is flushed to disk on COMMIT (controlled by synchronous_commit). On crash, the recovery process replays the WAL forward from the last checkpoint, redoing committed transactions and discarding incomplete ones.

Replication ships the WAL to other nodes:

Streaming (physical) replication — the WAL stream is sent byte-for-byte to a standby, which applies it to an identical block layout. Replica is read-only. Failover promotes the standby (timeline switch).
Logical replication — the WAL is decoded into row-level change events (INSERT INTO t VALUES ...) and replayed on the subscriber. Lets the subscriber have a different schema, a subset of tables, or a different Postgres major version. Used heavily for upgrades and CDC.

Sync mode is per-commit-tunable:

synchronous_commit = off       — ack before fsync; risk of losing recent commits on crash
                   = local     — wait for local fsync (default)
                   = on        — wait for at least one sync standby's WAL receive
                   = remote_apply — wait for the standby to apply the WAL

Most production deployments run on with one sync standby and several async standbys, giving zero data loss on primary failure plus extra read capacity.

Streaming physical replication — byte-identical replicas, simple, fast (microsecond lag is achievable on a healthy network). Replica must run identical Postgres version + extensions; whole cluster, not a subset.

Logical replication — row-level events, replica can be a different version / schema / subset. Higher CPU cost on the publisher (decoding the WAL), no DDL replication by default, and large transactions can lag. Used for upgrades, CDC pipelines, partial replication.

Backups are taken via pg_basebackup (full physical) plus continuous WAL archiving, allowing point-in-time recovery (PITR) to any second-resolution timestamp. Tooling layered on top (pgBackRest, WAL-G, Barman) handles retention, encryption, and S3 offload.

Operational characteristics#

A single Postgres instance is comfortable in the following envelope:

CPU: scales linearly with vCPUs up to ~64 cores; beyond that, contention on shared memory structures (lock manager, buffer pool partitions) limits the curve. Aurora Postgres and similar cloud variants restructure parts of this internally.
Memory: shared_buffers typically 25% of system RAM; work_mem is per-operator-per-query so multiply by expected concurrency.
Connections: 100-300 active backends per instance is the practical ceiling without a pooler. With pgbouncer transaction-pooling, a single instance can comfortably serve 5–10K client connections behind a few hundred backends.
Throughput: 10-50 K mixed TPS on commodity cloud hardware; 100 K+ on bare metal with NVMe. The TPC-C benchmarks on tuned Postgres routinely beat the marketing numbers of much newer databases.
Disk: WAL and data on separate volumes is standard; the WAL is the durability hot path and benefits from low-latency storage.

The community has stabilised on a recognisable operations toolkit: pg_stat_statements for query-level performance, pg_stat_activity for live session inspection, pg_locks for contention, pgbadger for log analysis, and a thick layer of cloud-vendor tools on top of those.

Extensions are the killer feature. The same binary runs PostGIS (geospatial), TimescaleDB (time-series), pgvector (vector similarity), Citus (sharding), pgcrypto, pg_partman, pg_repack, and dozens more. An extension is a CREATE EXTENSION foo; away and most are written in C with first-class hooks into the planner, executor, and storage layer.

Trade-offs and gotchas#

The places Postgres reliably surprises engineers coming from other systems:

Process model and connection cost. Already noted; this is the single most common production issue. Always front Postgres with a pooler past a few hundred concurrent users.
MVCC bloat. Heavy update workloads (UPDATE on a hot row repeatedly) generate dead tuples faster than autovacuum can clean them. The HOT (Heap-Only Tuple) optimisation helps when the updated columns aren’t indexed, but a table with many indexes and many updates is a classic bloat trap.
Transaction ID wraparound. Postgres uses 32-bit transaction IDs and recycles them. Autovacuum must “freeze” old tuples before the ID space wraps. A misconfigured cluster can hit wraparound and enter forced read-only mode — a real outage that has happened to large deployments.
No clustered primary key. Coming from InnoDB, expect every index lookup to cost an extra heap fetch. Index-only scans help when the index covers the query (CREATE INDEX … INCLUDE).
Materialised view refresh. REFRESH MATERIALIZED VIEW is a full rebuild by default. The CONCURRENTLY variant is non-blocking but requires a unique index on the view.
pg_dump is not a backup tool for large databases. It’s a logical export; restore takes hours for terabyte clusters. Use physical backup plus WAL archiving for anything past ~100 GB.
DDL takes ACCESS EXCLUSIVE locks — adding a column with a default in older versions blocked all readers and writers for the duration of the table rewrite. Postgres 11+ added fast non-rewriting ALTER TABLE ... ADD COLUMN ... DEFAULT for constant defaults, but every team learns the older behaviour the hard way at least once.
JSONB is great until it isn’t. Indexing arbitrary JSONB paths via GIN works but the index can dwarf the data; updates to JSONB columns rewrite the whole document. Use JSONB for semi-structured data that’s mostly read whole; use real columns for hot fields.

Why Postgres has won over the last decade

Three pressures converged. The cloud made operational maturity matter more than benchmark wins — every managed Postgres offering is one click away. The NoSQL retreat (most “we need NoSQL” use cases turned out to be “we need a JSON column”) played to Postgres’s strengths: JSONB, generated columns, and pgvector all let one engine serve workloads that previously needed three. And the extension ecosystem turned Postgres into a platform rather than a single-purpose database: PostGIS for maps, TimescaleDB for telemetry, Citus for sharding, pgvector for embeddings — each previously a separate system, now CREATE EXTENSION away.

MySQL / InnoDB — the other major open-source RDBMS, with a different storage model (clustered primary key) and a thread-per-connection process model.
MVCC — Multi-Version Concurrency Control — the concurrency mechanism Postgres uses end-to-end.
Write-Ahead Logging and Recovery — the durability substrate underneath replication and PITR.
B-Tree Indexes — Postgres’s default index type and the workhorse for OLTP queries.
Transactions and ACID — the contract Postgres implements per the letter of the spec.