MySQL / InnoDB
The B+ tree clustered-index storage engine, group replication, the historical mistakes and how 8.x undid them.
What it is#
MySQL is the other major open-source relational database — alongside PostgreSQL — and InnoDB is its default storage engine since version 5.5 (2010). The combination powers a sizeable fraction of the public internet: Wikipedia, Facebook’s social graph (in heavily-modified forks), GitHub, Shopify, Booking.com, Uber’s early monolith, and the typical LAMP-stack startup of the last twenty years. Owned by Oracle since the Sun acquisition in 2010, MySQL ships as Community Edition (GPL) and Enterprise Edition; MariaDB and Percona Server are notable open-source forks.
Where Postgres optimises for soundness and extensibility, MySQL/InnoDB optimises for throughput on a narrower OLTP shape, simpler operations, and aggressive replication. The architecture choices show: clustered primary key, thread-per-connection, integrated query optimiser tuned for the join shapes typical of web apps, and a replication ecosystem (statement, row, mixed, group replication, XtraDB Cluster) that’s been production-hardened by the LAMP era’s largest deployments.
Architecture overview#
MySQL uses a thread-per-connection model. A single mysqld process handles all clients; each connection is one OS thread inside that process. Memory is shared by default (the buffer pool, redo log buffer, query cache before it was removed in 8.0), which makes high-connection counts cheaper than Postgres’s process-per-connection but harder to isolate.
┌─────────────── mysqld ───────────────┐ │ connection threads │ │ parser → optimiser → executor │ │ │ │ ┌─────── pluggable storage ───────┐ │ │ │ InnoDB · MyISAM · CSV · │ │ │ │ Memory · Archive · NDB │ │ │ │ (InnoDB is the default) │ │ │ └──────────────┬───────────────────┘ │ │ │ │ │ buffer pool · redo log · undo log │ │ adaptive hash · change buffer │ └────────────────┬───────────────────────┘ ▼ data files + ib_logfile + binlogThe defining feature is the pluggable storage engine API. The same SQL layer can sit on top of multiple storage engines per table; in practice everything serious runs on InnoDB and the other engines are historical or niche (NDB for in-memory cluster, Archive for write-once tables). The pluggable design dates to MySQL’s earlier days when MyISAM was the default; it survives but adds a layer of indirection InnoDB-only engines avoid.
Binlog is the SQL-layer replication log, separate from InnoDB’s redo log. Every committed transaction writes to both — and coordinating the two via XA-style two-phase commit (sync_binlog=1, innodb_flush_log_at_trx_commit=1) is what makes commits durable across the binlog/InnoDB pair. The dual-log design is occasionally cited as the biggest architectural debt in MySQL; it’s also what enables MySQL’s diverse replication topologies.
Storage and indexing#
InnoDB is a B+ tree storage engine with one defining choice: the primary key is a clustered index. The table’s actual row data lives in the leaf pages of the primary-key B+ tree, sorted by primary-key value. Looking up a row by primary key is one B+ tree walk; the leaf page is the row.
Implications cascade from this:
- Secondary indexes point at the primary key, not at a physical row location. A secondary-index lookup walks the secondary tree to get the PK, then walks the primary tree to fetch the row — two B+ tree walks per secondary-key access. Postgres, by contrast, points indexes directly at heap TIDs (one tree walk plus a heap fetch).
- Primary-key choice matters for write performance. An auto-incrementing
BIGINTPK appends to the end of the clustered index — sequential writes, minimal page splits. A UUIDv4 PK scatters writes across the entire tree — random writes, constant page splits, dramatically worse write throughput. UUIDv7 (time-ordered) was retrofitted partly because of this. - No bloat from updates (in the InnoDB sense). MVCC versions live in the undo log, a separate segment. The clustered index always holds the latest version; old versions are reconstructed from undo when a snapshot needs them. This is the structural reason InnoDB doesn’t have Postgres’s autovacuum problem — but it has its own purge-thread analogue that lags under heavy updates.
Page size is 16 KB by default (configurable at instance creation only). The buffer pool — InnoDB’s page cache — is typically sized at 50-75% of system RAM; the famous tuning advice “set innodb_buffer_pool_size and walk away” is the single biggest performance lever in MySQL.
The redo log (ib_logfile0, ib_logfile1) is InnoDB’s WAL. Every page change writes a redo record; on crash, recovery replays from the last checkpoint. The redo log is fixed-size, circular, and sized to hold ~1 hour of writes — too small and checkpoints flush constantly, too large and recovery takes longer.
The change buffer is an InnoDB-specific optimisation: random writes to non-unique secondary indexes are buffered in memory and merged into the index when the affected pages are read or evicted. Drastically reduces random I/O on insert-heavy tables with many secondary indexes.
Query and transaction execution#
A query takes the path:
- Parser — SQL to parse tree.
- Optimiser — cost-based; historically less sophisticated than Postgres’s, but 8.0 added a hash join, lateral derived tables, and a real query rewrite framework. Statistics live in
mysql.innodb_table_statsandmysql.innodb_index_stats, updated automatically on data change beyond a threshold or manually viaANALYZE TABLE. - Executor — pulls rows via the storage-engine API. Most operators are familiar: ref / range / index / table scans, nested-loop joins (the historic default), hash joins (8.0+), sort, group-by.
EXPLAIN and EXPLAIN ANALYZE (8.0+) report plans. The MySQL plan output is notoriously terse — one row per access path — compared to Postgres’s verbose tree.
Transactions use MVCC via the undo log. Default isolation is REPEATABLE READ — different from Postgres’s READ COMMITTED. InnoDB’s REPEATABLE READ is, in practice, snapshot isolation with one important extension: gap locks prevent phantom inserts within a range a transaction has read. This makes InnoDB’s REPEATABLE READ stronger than the ANSI spec requires, at the cost of some unfortunate locking under high concurrency.
Isolation levels:
| Level | Behavior |
|---|---|
| READ UNCOMMITTED | Dirty reads allowed; rarely used |
| READ COMMITTED | Statement-level snapshots; no gap locks |
| REPEATABLE READ | Transaction-level snapshot + gap locks (default) |
| SERIALIZABLE | All reads as SELECT FOR SHARE |
Locking is row-level via the clustered index. Non-indexed UPDATEs lock every row scanned (no predicate index = no row-level lock granularity), which is the most common production locking surprise.
Replication and durability#
MySQL’s replication is the part that made it the social-graph default of the 2000s. The mechanism is binlog replication: every transaction is logged to the binary log on the primary; replicas pull the binlog and apply it.
Three binlog formats:
- Statement-based replication (SBR) — log the SQL statement itself. Compact but unsafe for non-deterministic functions (
NOW(),RAND(),UUID()). - Row-based replication (RBR) — log the before/after image of each changed row. Deterministic; the modern default.
- Mixed — statement when safe, row when not. The pragmatic compromise.
Two replication topologies:
- Asynchronous (default) — the primary acks the commit immediately; replicas catch up at their own pace. Fast; data loss possible on primary failure if a replica hasn’t received the last binlog events.
- Semi-synchronous — the primary waits for at least one replica to acknowledge receipt (not apply) before acking the commit. Better durability; one network round-trip added to commit latency.
MySQL Group Replication (5.7+) and the related InnoDB Cluster add Paxos-like quorum-based replication: writes go to a primary, are sent to all members, committed only if a majority acknowledges. Automatic failover, no manual replica promotion. Group Replication is to MySQL what Patroni-managed streaming replication is to Postgres — the supported HA story.
GTIDs (Global Transaction Identifiers) tag every transaction across the cluster, simplifying failover by making “where were we?” a first-class question instead of an offset-tracking exercise. Pre-GTID MySQL replication failover was infamously fiddly.
Operational characteristics#
A single InnoDB instance comfortably runs in this envelope:
- CPU: scales to 32-64 cores; beyond that, lock-manager and adaptive-hash contention bite. The thread-per-connection model means high connection counts (10K+) work without a pooler, though ProxySQL or
mysql-routerare common at scale. - Memory: buffer pool 50-75% of system RAM; the buffer pool is the performance lever. 256 GB buffer pools on bare metal are routine.
- Throughput: 50-100 K QPS on commodity cloud hardware; the original MySQL appeal — “good enough at small, great at large with the right tuning” — still holds.
- Connections: 10K idle connections is fine; active concurrent transactions are limited by lock-manager scaling — practical ceiling around 1-2K active.
Operational toolkit: SHOW ENGINE INNODB STATUS for the lock and buffer-pool state, performance_schema for query-level data, sys.* views for the human-readable summaries, Percona Toolkit (pt-query-digest, pt-online-schema-change) for production work, MySQL Shell for cluster operations.
Online schema change is where MySQL has historically lagged Postgres. ALTER TABLE traditionally rebuilt the entire table; tools like pt-online-schema-change and gh-ost (GitHub’s open-source tool) emerged to do online ALTERs by creating a shadow table, double-writing, and swapping. 8.0 added instant ADD COLUMN and other non-rewriting ALTERs, narrowing but not closing the gap.
Trade-offs and gotchas#
The places InnoDB reliably surprises:
- Clustered index means PK choice is irreversible. Picking a poor PK (UUIDv4, hash, anything non-sequential) at table creation locks in slow writes forever. Recreating the table to change the PK on a billion-row table is a project, not a maintenance window.
- Secondary index width = primary key width. Every secondary index entry includes the PK as the row pointer. A wide PK inflates every secondary index.
- Gap locks under REPEATABLE READ. Two transactions selecting overlapping ranges can deadlock on gap locks that no row actually occupies. Either drop to READ COMMITTED (most teams eventually do for high-concurrency tables) or design schemas to avoid range scans on hot tables.
innodb_buffer_pool_sizeis set at startup. Resizing online is technically supported in 5.7+ but pauses the engine. Plan capacity.- The binlog can become the bottleneck. A heavy-write primary fsync-ing the binlog and the redo log per commit is doing two fsyncs per transaction; on slow storage this caps throughput. Group commit (
binlog_group_commit_sync_delay) is the standard mitigation. - MyISAM legacy. Tables on MyISAM have no transactions, no row-level locks, no crash safety, and no foreign keys. Every now and then a legacy table or system table is still on MyISAM; migrating is its own project.
- Default character set drama. Pre-8.0 default was
latin1; pre-5.7utf8meant 3-byte UTF-8 only (no astral characters). 8.0 default isutf8mb4. Older databases carry the scars; migrations are subtle. - XA transactions. Useful when binlog + InnoDB redo + replicas all need to agree, but the heuristic-decision recovery story is awkward and the XA prepared-transaction state can leak under failure.
The 8.0 release that quietly fixed twenty years of pain
MySQL 8.0 (2018) is the release that closed many of the historical gaps. Atomic DDL (CREATE / DROP / ALTER are now transactional, eliminating “half-applied schema changes after crash”). A real data dictionary instead of .frm files. Window functions and CTEs (more than a decade after Postgres). Hash joins. Invisible indexes (test removal without dropping). Persisted statistics. Instant ADD COLUMN. Default character set utf8mb4. The deprecation of the query cache, which had become a multi-core scaling bottleneck. The post-8.0 MySQL is a genuinely different system from the 5.x line that earned its reputation for foot-guns; teams running 5.7 in 2026 are running yesterday’s MySQL.
Related systems#
- PostgreSQL — The Reference Open-Source RDBMS — the other major open-source RDBMS, with a heap-based storage layout and a process-per-connection model.
- B-Tree Indexes — InnoDB’s everything-is-a-B+-tree storage layout.
- MVCC — Multi-Version Concurrency Control — InnoDB’s concurrency model, implemented via undo segments.
- Write-Ahead Logging and Recovery — the redo log and binlog together provide InnoDB’s durability and replication.
- Transactions and ACID — the contract InnoDB implements; the dual-log design is the price of binlog-based replication.