Sequencer — System Design · Engineering Playbook

Use cases#

A sequencer produces monotonically increasing, globally unique identifiers. Three pressing reasons distributed systems need one:

Primary keys at scale — auto-incrementing IDs require a coordinator; in a sharded system, every shard rolls its own and collides. A sequencer issues IDs that are unique without coordination.
Causal ordering — “did A happen before B?” answered without a single physical clock. Critical for replicated logs, conflict resolution, transaction ordering.
Time-ordered scans — IDs that sort by approximate creation time make timeline / feed / log queries efficient (range scans instead of secondary indexes).

The boundary case: snowflake IDs let you embed a timestamp in the ID, so SELECT * FROM tweets WHERE id > <yesterday> becomes a cheap range scan.

Functional requirements#

Produce unique 64-bit (or 128-bit) IDs at high throughput.
IDs are roughly monotonic — newer > older for typical comparisons.
Generation works during network partitions; no coordinator on the hot path.
Optionally encode useful metadata (timestamp, generator ID, type).

Non-functional requirements#

Latency: ID generation must be local — under 1 µs. Anything that requires a network round-trip is unusable in OLTP.
Throughput: 10k+ IDs/sec per generator; cluster-wide 100M+/sec.
Uniqueness: zero collisions, even under clock skew, restart, or worker reassignment.
Causality: if event A causally precedes B, id(A) < id(B) (or at least id(A) <= id(B) with HLC).

High-level design#

worker 1: [ 41-bit timestamp ms ][ 10-bit worker ID ][ 12-bit sequence ]
worker 2: [ 41-bit timestamp ms ][ 10-bit worker ID ][ 12-bit sequence ]
worker N: [ 41-bit timestamp ms ][ 10-bit worker ID ][ 12-bit sequence ]
                  ▲                     ▲                    ▲
                  │                     │                    │
              wall-clock           assigned at         per-ms counter
              (epoch-relative)     boot, unique         resets each ms

every worker generates IDs locally with no coordination

Each worker has a unique ID (assigned via ZooKeeper, Consul, or static config). The timestamp gives rough ordering; the worker ID guarantees uniqueness across workers; the per-ms sequence breaks ties within a single millisecond.

Detailed design#

Snowflake ID layout (Twitter’s original)#

sign | timestamp (41 bits) | worker (5) | datacenter (5) | sequence (12)
  1  |  ~69 years from epoch |   32 dc   |    32 worker  | 4096/ms/worker

41-bit timestamp is ms since a custom epoch. ~69 years before rollover.
5-bit datacenter + 5-bit worker allows 32×32 = 1024 generator instances.
12-bit sequence gives 4096 IDs per worker per millisecond — over 4M IDs/sec/worker.

When sequence overflows, busy-wait until the next millisecond.

class Snowflake {
    long lastMs = 0;
    int seq = 0;
    int workerId;

    long nextId() {
        long now = currentMs();
        if (now == lastMs) {
            seq = (seq + 1) & 0xFFF;
            if (seq == 0) now = tilNextMs(lastMs);
        } else {
            seq = 0;
        }
        lastMs = now;
        return (now << 22) | (workerId << 12) | seq;
    }
};

Clock skew is the central problem#

If a worker’s clock moves backwards (NTP step, VM pause, reboot), it might issue an ID smaller than one it already issued. Mitigations:

Refuse to issue IDs while now < lastMs (busy-wait, error after timeout).
Use monotonic clocks (CLOCK_MONOTONIC) for the elapsed component, with wall-clock only at startup.
Persist lastMs and refuse to start below it.

Hybrid Logical Clocks (HLC)#

When wall-clock isn’t trustworthy and Lamport clocks (pure counters) lose physical-time meaning, HLCs combine both:

HLC = (physical_ts, logical_counter)

on local event:
   pt = max(physical_now, hlc.pt)
   if pt == hlc.pt: hlc.lc += 1
   else: hlc.pt = pt; hlc.lc = 0

on receive(msg.hlc):
   pt = max(hlc.pt, msg.hlc.pt, physical_now)
   if pt == hlc.pt == msg.hlc.pt: hlc.lc = max(hlc.lc, msg.hlc.lc) + 1
   elif pt == hlc.pt: hlc.lc += 1
   elif pt == msg.hlc.pt: hlc.lc = msg.hlc.lc + 1
   else: hlc.lc = 0
   hlc.pt = pt

HLCs preserve causality (any happens-before edge in the system gives a strictly increasing HLC) while staying close to wall-clock time. CockroachDB and YugabyteDB use them internally for MVCC timestamps.

Spanner’s TrueTime#

Google solved the problem with hardware: every datacenter has GPS receivers and atomic clocks; the API returns a time interval [earliest, latest] rather than a point. Spanner waits out the uncertainty bound (~7 ms) before committing, guaranteeing externally consistent ordering across the entire planet. Cost: GPS + atomic clocks in every DC. Beautiful, but Google-only for now.

Range allocation (cheap alternative)#

A coordinator assigns 1000-ID blocks to each worker. Workers issue IDs locally from their block; when exhausted, request a new block. Pros: no clock dependency. Cons: gaps in the ID space after worker death, coordinator on the cold path.

MongoDB’s ObjectId and Flickr’s “ticket server” both use variants of this.

UUIDv7#

A newer standard (RFC 9562) — 128-bit ID with a 48-bit ms timestamp prefix and 74 random bits. Sorts by time, no coordinator needed, no clock-skew collision risk (random bits dwarf workers). Drop-in for snowflake when you don’t care about a worker-ID dimension.

Trade-offs#

Snowflake / time-prefixed — 64-bit, sorts by time, embeds metadata, requires careful worker-ID assignment and clock discipline. The dominant choice for new systems.

UUIDv4 (random) — 128-bit, no coordinator, no clock, never collides in practice. Costs 2× index space and destroys time-locality on disk (every insert hits a random B-tree page).

Other axes:

Coordinated (Flickr ticket server) vs uncoordinated (Snowflake) — coordinated is simpler to reason about but has a tight failure mode (coordinator down → no new IDs). Uncoordinated is more available but harder to debug.
Strict total order vs causal order — Lamport / HLC give causal; TrueTime gives external; Snowflake gives “approximately monotonic”. Pick by what your system actually needs.
64-bit vs 128-bit — 64-bit fits in registers, half the index space, sorts faster. 128-bit (UUID) is collision-proof under any clock skew, friendlier for systems that share IDs across organizational boundaries.

Real-world examples#

Twitter Snowflake — the canonical design (2010). Powers tweet IDs, DM IDs, and most Twitter primary keys.
Instagram — variant with (timestamp, shard_id, per_shard_sequence) — encodes which Postgres shard owns the row directly in the ID.
Discord — Snowflake IDs for messages, channels, users. Their time-bucketed Cassandra schema partitions on (channel_id, time_bucket_from_id).
Spanner — TrueTime-based commit timestamps; powers Google Ads, Gmail, Drive.
CockroachDB / YugabyteDB — HLC for MVCC; chosen so they can run on commodity hardware without atomic clocks.
MongoDB — ObjectId is (4-byte timestamp, 5-byte random, 3-byte counter) — 12 bytes, mostly snowflake-style.

Distributed Cache — sequencers cache the latest issued ID locally.
Databases — primary keys are the most common consumer.
Distributed Messaging Queue — uses sequencer IDs to dedupe and order messages.
Distributed Logging — log records carry sequencer IDs for global ordering across hosts.