Sequencer
Globally-ordered unique IDs with causality. Snowflake, TrueTime, hybrid logical clocks.
Use cases#
A sequencer produces monotonically increasing, globally unique identifiers. Three pressing reasons distributed systems need one:
- Primary keys at scale — auto-incrementing IDs require a coordinator; in a sharded system, every shard rolls its own and collides. A sequencer issues IDs that are unique without coordination.
- Causal ordering — “did A happen before B?” answered without a single physical clock. Critical for replicated logs, conflict resolution, transaction ordering.
- Time-ordered scans — IDs that sort by approximate creation time make timeline / feed / log queries efficient (range scans instead of secondary indexes).
The boundary case: snowflake IDs let you embed a timestamp in the ID, so SELECT * FROM tweets WHERE id > <yesterday> becomes a cheap range scan.
Functional requirements#
- Produce unique 64-bit (or 128-bit) IDs at high throughput.
- IDs are roughly monotonic — newer > older for typical comparisons.
- Generation works during network partitions; no coordinator on the hot path.
- Optionally encode useful metadata (timestamp, generator ID, type).
Non-functional requirements#
- Latency: ID generation must be local — under 1 µs. Anything that requires a network round-trip is unusable in OLTP.
- Throughput: 10k+ IDs/sec per generator; cluster-wide 100M+/sec.
- Uniqueness: zero collisions, even under clock skew, restart, or worker reassignment.
- Causality: if event A causally precedes B,
id(A) < id(B)(or at leastid(A) <= id(B)with HLC).
High-level design#
worker 1: [ 41-bit timestamp ms ][ 10-bit worker ID ][ 12-bit sequence ]worker 2: [ 41-bit timestamp ms ][ 10-bit worker ID ][ 12-bit sequence ]worker N: [ 41-bit timestamp ms ][ 10-bit worker ID ][ 12-bit sequence ] ▲ ▲ ▲ │ │ │ wall-clock assigned at per-ms counter (epoch-relative) boot, unique resets each ms
every worker generates IDs locally with no coordinationEach worker has a unique ID (assigned via ZooKeeper, Consul, or static config). The timestamp gives rough ordering; the worker ID guarantees uniqueness across workers; the per-ms sequence breaks ties within a single millisecond.
Detailed design#
Snowflake ID layout (Twitter’s original)#
sign | timestamp (41 bits) | worker (5) | datacenter (5) | sequence (12) 1 | ~69 years from epoch | 32 dc | 32 worker | 4096/ms/worker- 41-bit timestamp is ms since a custom epoch. ~69 years before rollover.
- 5-bit datacenter + 5-bit worker allows 32×32 = 1024 generator instances.
- 12-bit sequence gives 4096 IDs per worker per millisecond — over 4M IDs/sec/worker.
When sequence overflows, busy-wait until the next millisecond.
class Snowflake { long lastMs = 0; int seq = 0; int workerId;
long nextId() { long now = currentMs(); if (now == lastMs) { seq = (seq + 1) & 0xFFF; if (seq == 0) now = tilNextMs(lastMs); } else { seq = 0; } lastMs = now; return (now << 22) | (workerId << 12) | seq; }};Clock skew is the central problem#
If a worker’s clock moves backwards (NTP step, VM pause, reboot), it might issue an ID smaller than one it already issued. Mitigations:
- Refuse to issue IDs while
now < lastMs(busy-wait, error after timeout). - Use monotonic clocks (
CLOCK_MONOTONIC) for the elapsed component, with wall-clock only at startup. - Persist
lastMsand refuse to start below it.
Hybrid Logical Clocks (HLC)#
When wall-clock isn’t trustworthy and Lamport clocks (pure counters) lose physical-time meaning, HLCs combine both:
HLC = (physical_ts, logical_counter)
on local event: pt = max(physical_now, hlc.pt) if pt == hlc.pt: hlc.lc += 1 else: hlc.pt = pt; hlc.lc = 0
on receive(msg.hlc): pt = max(hlc.pt, msg.hlc.pt, physical_now) if pt == hlc.pt == msg.hlc.pt: hlc.lc = max(hlc.lc, msg.hlc.lc) + 1 elif pt == hlc.pt: hlc.lc += 1 elif pt == msg.hlc.pt: hlc.lc = msg.hlc.lc + 1 else: hlc.lc = 0 hlc.pt = ptHLCs preserve causality (any happens-before edge in the system gives a strictly increasing HLC) while staying close to wall-clock time. CockroachDB and YugabyteDB use them internally for MVCC timestamps.
Spanner’s TrueTime#
Google solved the problem with hardware: every datacenter has GPS receivers and atomic clocks; the API returns a time interval [earliest, latest] rather than a point. Spanner waits out the uncertainty bound (~7 ms) before committing, guaranteeing externally consistent ordering across the entire planet. Cost: GPS + atomic clocks in every DC. Beautiful, but Google-only for now.
Range allocation (cheap alternative)#
A coordinator assigns 1000-ID blocks to each worker. Workers issue IDs locally from their block; when exhausted, request a new block. Pros: no clock dependency. Cons: gaps in the ID space after worker death, coordinator on the cold path.
MongoDB’s ObjectId and Flickr’s “ticket server” both use variants of this.
UUIDv7#
A newer standard (RFC 9562) — 128-bit ID with a 48-bit ms timestamp prefix and 74 random bits. Sorts by time, no coordinator needed, no clock-skew collision risk (random bits dwarf workers). Drop-in for snowflake when you don’t care about a worker-ID dimension.
Trade-offs#
Other axes:
- Coordinated (Flickr ticket server) vs uncoordinated (Snowflake) — coordinated is simpler to reason about but has a tight failure mode (coordinator down → no new IDs). Uncoordinated is more available but harder to debug.
- Strict total order vs causal order — Lamport / HLC give causal; TrueTime gives external; Snowflake gives “approximately monotonic”. Pick by what your system actually needs.
- 64-bit vs 128-bit — 64-bit fits in registers, half the index space, sorts faster. 128-bit (UUID) is collision-proof under any clock skew, friendlier for systems that share IDs across organizational boundaries.
Real-world examples#
- Twitter Snowflake — the canonical design (2010). Powers tweet IDs, DM IDs, and most Twitter primary keys.
- Instagram — variant with
(timestamp, shard_id, per_shard_sequence)— encodes which Postgres shard owns the row directly in the ID. - Discord — Snowflake IDs for messages, channels, users. Their time-bucketed Cassandra schema partitions on
(channel_id, time_bucket_from_id). - Spanner — TrueTime-based commit timestamps; powers Google Ads, Gmail, Drive.
- CockroachDB / YugabyteDB — HLC for MVCC; chosen so they can run on commodity hardware without atomic clocks.
- MongoDB —
ObjectIdis(4-byte timestamp, 5-byte random, 3-byte counter)— 12 bytes, mostly snowflake-style.
Related building blocks#
- Distributed Cache — sequencers cache the latest issued ID locally.
- Databases — primary keys are the most common consumer.
- Distributed Messaging Queue — uses sequencer IDs to dedupe and order messages.
- Distributed Logging — log records carry sequencer IDs for global ordering across hosts.