Google Spanner — DBMS · Engineering Playbook

What it is#

Google Spanner is the planet-scale relational database Google built between 2007 and 2012 and published in the OSDI 2012 paper (Corbett et al.) that reshaped distributed-database research. It is the storage layer for Google’s most demanding internal systems — AdWords, Google Photos metadata, Gmail account metadata, Google Cloud’s customer billing — and is available externally as Cloud Spanner since 2017. The breakthrough idea was TrueTime: a globally synchronised clock with bounded uncertainty, exposed as an API that returns [earliest, latest] rather than a single timestamp, letting Spanner serialize transactions across continents while preserving external consistency.

Before Spanner, distributed databases offered either strong consistency in one region (Postgres + sync replicas) or eventual consistency across regions (Dynamo, Cassandra). Spanner offered externally consistent transactions across continents — meaning that if transaction T1 finishes before T2 starts (in real time), every observer sees T1’s effects before T2’s. No prior production system had achieved that property at global scale; many had argued it was impossible without prohibitive latency. TrueTime made it tractable.

Architecture overview#

Spanner organises data into zones (think availability zones), spanservers (storage nodes), and directories (the unit of geographic placement). A spanserver hosts hundreds of tablets — contiguous ranges of the keyspace. Each tablet is replicated across multiple zones via Paxos, forming a Paxos group. A transaction touches one or more groups; one group is the leader for the transaction; cross-group transactions use two-phase commit on top of Paxos.

                   client
                      │
                      ▼
              ┌─── universe master ───┐
              │ placement decisions  │
              │ schema management    │
              └──────────┬───────────┘
                         │
        ┌────────────────┼────────────────┐
        │                │                │
   ┌─ zone A ─┐     ┌─ zone B ─┐     ┌─ zone C ─┐
   │ spanserv │     │ spanserv │     │ spanserv │
   │ ┌─tablet─┐│     │ ┌─tablet─┐│     │ ┌─tablet─┐│
   │ │ Paxos  ││  ←→ │ │ Paxos  ││  ←→ │ │ Paxos  ││
   │ │ replica││     │ │ replica││     │ │ leader ││
   │ └────────┘│     │ └────────┘│     │ └────────┘│
   └──────────┘     └──────────┘     └──────────┘
                                              │
                                              ▼
                                       Colossus (storage)
                                       TrueTime (clocks)

Each zone runs:

A zonemaster assigning tablets to spanservers.
A location proxy routing client requests to the spanserver holding the relevant tablet.
A few spanservers hosting the actual data.

The placement driver moves directories between zones in response to access patterns, capacity, and explicit policy (“keep this customer’s data in EU”). Movement is online; transactions in flight see a brief blip but no failure.

The underlying storage is Colossus (Google’s GFS successor), a distributed file system providing replicated, durable file storage. Spanner writes its log and SSTables to Colossus, which handles disk-level durability and replication within a zone.

Storage and indexing#

The on-disk format is B-tree-like sorted SSTables with a WAL — close to the Bigtable lineage but layered with relational schema, joins, and secondary indexes. Each tablet holds a contiguous range of (key, timestamp) → value pairs. The timestamp is the MVCC version; older versions are kept for the configured retention (default 1 hour, configurable up to weeks for time-travel queries).

Schema is fully relational: tables, columns, primary keys, foreign keys, secondary indexes, interleaved hierarchies. The latter is a Spanner-specific optimisation: declaring Orders interleaved in Users places each order in the same Paxos group as its parent user, making SELECT * FROM Users JOIN Orders on one user a single-group query that avoids two-phase commit.

Indexes are themselves tables, automatically maintained. A secondary index IDX_orders_by_status is a Spanner table partitioned by status, transactionally updated whenever the base table changes. Cross-table consistency comes for free because the index update is part of the same transaction.

Query language is GoogleSQL (a strict superset of ANSI SQL with Google-specific extensions) — joins, subqueries, window functions, CTEs, all standard. The Cloud Spanner PostgreSQL dialect (2022+) added Postgres syntax compatibility for migrations.

Query and transaction execution#

Spanner runs three transaction kinds:

Read-write transactions — full ACID across any keys; use 2PC across Paxos groups. Up to ~10s of write throughput per group; multi-group transactions are slower (the 2PC tax).
Read-only transactions — snapshot reads at a chosen (possibly historical) timestamp; do not block writers; do not block each other. The common path for most reads.
Snapshot reads — single-key or short-range reads at an explicit past timestamp. Used for backfills, debugging, MVCC time-travel.

A read-write transaction:

Client picks a coordinator (one of the Paxos group leaders involved).
Client sends reads to each group; each group leader returns the current snapshot version.
Client buffers writes locally; on commit, sends them to the coordinator.
Coordinator picks a commit timestamp s using TrueTime: s = TT.now().latest plus margin for safety.
Coordinator runs 2PC across all involved groups; each group applies the writes at timestamp s via its Paxos log.
Coordinator waits out the clock uncertainty — sleeps until TT.now().earliest > s — so the commit is guaranteed to be in the past from any future observer’s perspective.
Reply to client.

The commit wait is the central novelty. It’s typically 6-7 ms (TrueTime’s uncertainty window) plus the 2PC round-trip — meaning a multi-region transaction commit is in the tens to hundreds of ms depending on the geography.

Replication and durability#

Each tablet’s writes are committed via Paxos across its replicas. A typical 3-replica or 5-replica configuration places replicas in different zones (and often different regions). A write commits when a Paxos majority acks the log entry.

Replication topologies:

Regional — 3 replicas in 3 zones of one region. Lowest latency; survives a zone failure.
Multi-region — 5 replicas spread across regions, e.g., 2 in us-east1, 2 in us-west1, 1 witness in a third region. Writes still commit on a quorum; reads can be served from a local replica with bounded staleness (eventual) or always from the leader (strong).
Continent-spanning — 5+ replicas across continents; writes commit when a global majority acks. Commit latency dominated by the cross-continent RTT.

The write quorum is the key tunable. More replicas = better durability and locality but higher commit latency.

TrueTime is the second pillar of durability. Google deploys atomic clocks and GPS receivers in every datacenter; a time master in each zone polls them; spanservers poll the time masters. The result is an interval of width ~7 ms (the bound on clock skew between any two nodes). Every commit waits out this interval; every read at timestamp s reads from any replica that has acknowledged Paxos log entries up to s.

The public Cloud Spanner equivalent uses Google’s internal clock infrastructure, exposed via the same TrueTime API as internal Spanner.

Operational characteristics#

Cloud Spanner exposes capacity in nodes (or processing units; 1000 PU = 1 node). A node provides:

~10,000 QPS of single-key reads at single-digit-ms latency.
~1,800 QPS of single-key writes.
2 TB of storage per node (instance-level; data shards across all nodes).
99.999% availability at multi-region scale; 99.99% regional.

The envelope:

Cost — Spanner is expensive on the dollars-per-QPS axis vs DynamoDB or Postgres. The economic justification is the SQL surface plus global consistency.
Latency — regional p99 in the 5-15 ms range for single-row reads; multi-region p99 commits in the 50-200 ms range depending on geography. Commit wait adds 6-7 ms minimum to every write.
Throughput scaling — linear with node count, as long as the schema avoids hot rows. Hot-row contention has the same shape as in any system: one Paxos group serialising too much work.
Schema changes — online, including index creation. A schema change is a background operation that doesn’t block traffic; large index builds take hours but don’t interrupt writes.
Backups — automatic point-in-time backups; restore creates a new database; cross-region copy supported.

Trade-offs and gotchas#

Operating Spanner well requires understanding:

Hot keys still exist. A partition key with skewed traffic concentrates load on one Paxos group; the group’s leader becomes the bottleneck. Same write-sharding strategies apply as in DynamoDB.
Interleaved tables vs joins. Interleaving Orders in Users keeps a customer’s data in one group, making most queries local — but limits how that data can be queried across customers. Reaching across groups for a global query incurs the 2PC tax.
Commit wait is non-negotiable. Every multi-region write pays ~7 ms minimum. Workloads built on the assumption of sub-millisecond writes don’t fit Spanner.
Long transactions hurt. Each transaction holds locks on every group it touches until commit. A 10-second transaction is a 10-second lock on potentially many tablets.
Stale reads are an option, not a default. MaxStaleness(N seconds) reads are routed to local replicas without contacting the leader, dramatically reducing latency. Many read-heavy workloads use 10-15 second staleness with no application impact.
Schema needs forethought. Choice of primary key, interleaving, and indexes is hard to change later on a multi-TB table. The relational surface tempts engineers to “design as you go”; Spanner punishes that more than Postgres does.
Vendor lock-in is total. Self-hosted Spanner doesn’t exist; CockroachDB and YugabyteDB are conceptual rivals but neither implements TrueTime. Migration off Cloud Spanner is a multi-quarter project.
Cost mistakes are common. Forgetting to size down a dev instance, leaving unused indexes, not consolidating low-throughput tables — Spanner’s per-node pricing punishes these. Engineering review of node usage is part of routine ops.

Spanner — global external consistency, SQL surface, transactional joins, automatic placement, online schema changes. Per-node cost is high; commit wait adds 6-7 ms minimum; Google Cloud only; learning curve for the schema and placement model.

Postgres + cross-region read replicas — cheaper, no vendor lock-in, familiar tooling, identical SQL on every cloud. Strong consistency only in the primary region; replication lag on the read side; failover is operationally complex; no globally consistent writes.

Why TrueTime was the breakthrough that nobody else copied

Most distributed databases work hard to be correct without trusting clocks — Lamport timestamps, vector clocks, Raft logs ordering events. The general view in research and industry pre-2012 was that physical clocks couldn’t be relied on. Spanner inverted this: instead of avoiding clocks, treat them as a first-class API with explicit uncertainty. The hardware investment is real — atomic clocks and GPS receivers in every datacenter, a time-server fleet, every spanserver running a TT.now() syscall. Open-source alternatives (CockroachDB, YugabyteDB) use Hybrid Logical Clocks instead — they get most of Spanner’s properties but pay a different cost: HLC can give incorrect timestamps if a node’s clock drifts past the bound, and the system must detect and correct or reject those cases. Spanner’s clock infrastructure can be built; it just takes Google’s scale to amortize the cost. Cloud Spanner is the productisation of that investment.

Amazon DynamoDB — the AWS counterpart for hyperscale storage; different consistency story.
MVCC — Multi-Version Concurrency Control — Spanner’s concurrency model, with timestamps as the version axis.
Transactions and ACID — the contract; Spanner extends it to external consistency across continents.
Write-Ahead Logging and Recovery — Paxos logs are Spanner’s WAL; recovery is replay from the log.
PostgreSQL — The Reference Open-Source RDBMS — Cloud Spanner offers a Postgres dialect; the operational tradeoffs are very different.