Google Docs (Collaborative Editing) — System Design

Step 1 — Clarify Requirements#

Functional

Multiple users can edit the same document simultaneously; all see each other’s edits within ~1 second.
Edits made while offline reconcile cleanly when the user reconnects.
Document state converges — all collaborators end up seeing the same content.
Presence: who else is in the doc, where their cursor is.
Version history: snapshots and the ability to view/revert past versions.
Out of scope: rich media embeds, comments threads (covered by a separate service), permissions UI.

Non-functional

99.99% availability.
p99 keystroke-to-collaborator-sees latency under 200 ms.
Convergence: all clients arrive at the same final document state given the same set of edits.
Document size: most docs under 1 MB; long-tail up to tens of MB.

Step 2 — Capacity Estimation#

Active documents with concurrent editing: ~1 M at any moment globally.
Average concurrent editors per active doc: 2-5; long-tail up to ~100 in a few large docs.
Edits per second per doc: ~5 keystrokes/sec/user × 3 users = ~15 edits/sec per active doc.
Total edits/sec: 1 M docs × 15 edits = ~15 M edits/sec at peak. Most are small (~50 B of operation metadata).
Storage: ~1 B docs total × 100 KB average = 100 TB of doc state. Plus version history × 10-50 versions = ~1-5 PB.
WebSocket connections: active editors only: ~5 M concurrent.

The system is stateful with continuous low-latency synchronization. Storage is modest; latency and convergence are the hard parts.

Step 3 — System Interface#

// WebSocket protocol
CONNECT { doc_id, since_revision }
EDIT    { op, revision_seen }              -- client sends an edit
ACK     { revision_assigned }              -- server confirms with new revision
REMOTE  { op, revision }                    -- server pushes others' edits
CURSOR  { user_id, position }
PRESENCE { joined / left }

// HTTP (cold path)
GET   /docs/:id                          (initial load, returns doc + revision)
POST  /docs                              (create)
GET   /docs/:id/history?since=...        (version history)

The since_revision parameter is the foundation of incremental sync — a client reconnecting after a brief outage replays only the operations it missed.

Step 4 — High-Level Design#

       client A                                       client B
          │                                              │
          ▼                                              ▼
       WebSocket gateway ◀────── per-doc affinity ─────► WebSocket gateway
          │                                              │
          └──────────────► doc shard owner ◀─────────────┘
                                  │
                                  ▼
                       in-memory doc state + op log
                                  │
                                  ▼
                          periodic snapshot to durable store
                                  │
                                  ▼
                     storage (Spanner-class) + version history

The per-doc shard owner is the linchpin. One process per doc is authoritative for ordering operations. All edits flow through it; it assigns monotonic revisions; it broadcasts to subscribers.

This shape avoids the multi-leader consistency problem entirely: there’s one leader per doc, even if the broader system is multi-region.

Step 5 — Data Model#

Document content (in memory on the shard owner; periodically snapshotted):

doc:
  doc_id: uuid
  revision: int                  // monotonic
  content: rope                  // editable text structure
  op_log: list<Op>               // last N ops since last snapshot

Operations (the unit of edit):

struct Op:
  type: insert | delete | format
  position: int (or path for tree-structured docs)
  content: string (for insert)
  length: int (for delete)
  author_id: uuid
  client_clock: int             // for OT; vector clock entry for CRDT

Persistent store (per doc):

table docs
  doc_id      uuid       PK
  current_revision int
  snapshot    bytes     // serialized doc state
  snapshot_revision int
table doc_ops
  doc_id       uuid     PK
  revision     int      CK
  op           bytes
  author_id    uuid
  ts           timestamp
table doc_snapshots         // immutable; for version history
  doc_id       uuid
  revision     int
  taken_at     timestamp
  bytes        bytes

Step 6 — Detailed Design#

The core problem: concurrent edits#

Two users editing position 5 simultaneously will produce different intentions. Some algorithm has to reconcile them. Two main approaches:

Operational Transformation (OT) — server orders all ops linearly. When op A (from user 1) arrives at the same time as op B (from user 2), the server transforms B against A so both end up applying coherently. Requires a central server (the order is what the server says). Used by Google Docs.

CRDT (Conflict-free Replicated Data Type) — ops carry enough metadata (unique IDs, causal context) that they commute regardless of order. No central server required for correctness; suitable for peer-to-peer and offline-first. Used by Figma, Yjs-based tools (Notion-likes).

Both work. The trade-off:

OT is operationally simpler when you already have a server. Op size is small (an int position + content). Transforms are tricky for rich-text but the math is well-known. Bad fit for offline-first / P2P because ordering depends on the server.
CRDT is offline-first and decentralization-friendly. Op size includes unique IDs and causal context — bigger. Document state can grow with tombstones for deletions; needs garbage collection. The math is harder to get right but the runtime behavior is simpler.

OT in practice (the Google Docs choice)#

User A's local state at revision 10 → user A types 'x' at position 5 → op_A
User B's local state at revision 10 → user B types 'y' at position 5 → op_B

Both send to server.

Server (already at revision 10) accepts op_A first → revision 11. Broadcasts.
Server receives op_B based on revision 10 → transforms op_B against op_A:
   transformed_B has its position shifted to 6 (because A inserted before)
   → revision 12. Broadcasts transformed_B.

Each client applies the broadcast op, transformed against any local ops still
pending. Both converge to revision 12 = "yx" (or "xy" depending on tie-break).

The transform function is operation-pair specific: insert vs insert, insert vs delete, delete vs delete. For plain text, it’s ~6 cases. For rich text (formatting, embeds, tables), it’s hundreds.

Per-document affinity#

All ops for a doc go to one shard owner. The WebSocket gateway routes based on doc_id hash. If gateway A receives a client for doc X but the owner is on gateway B, the gateway forwards over a backend connection.

This is the same shape as /system-design/whatsapp — stateful router for long-lived connections.

Snapshot and op-log split#

The shard owner keeps:

A base snapshot at some revision (say, every 1000 ops).
The op log of all ops since the snapshot.

A new client joining sees the snapshot + the op log applied. On crash recovery, the shard owner re-loads the snapshot and replays the log.

Snapshots are durable; ops are durable in their own append-only log. The op log is what’s authoritative for ordering — the snapshot is a periodic optimization for catch-up.

Latency budget (target 200 ms keystroke-to-remote)#

Client A keystroke → local apply (optimistic):     0 ms
Send WS op to gateway A:                           20 ms
Forward to shard owner:                            5 ms
Owner orders op (assigns revision):                 1 ms
Broadcast to other gateways:                       20 ms
Send WS REMOTE op to client B:                     20 ms
Client B applies (with own pending transforms):    5 ms
                                          total:  ~70 ms p50
                                                  ~150-200 ms p99 cross-region

Optimistic local application is what makes typing feel instant; remote convergence is the part we budget for.

Offline editing#

A client offline collects local ops, all based on its last-known revision (say, 100). On reconnect:

1. Client sends queued ops + base revision (100).
2. Server has progressed to revision 150 meanwhile.
3. Server transforms each queued op against ops 101-150 in order.
4. Server applies the transformed ops, assigning revisions 151-160.
5. Server returns ACKs.
6. Server broadcasts ops to others (already happened for 101-150, now happens for 151+).

The longer the offline window, the more transforms required — but the cost is bounded by the volume of intervening ops. CRDTs handle this more naturally (no transforms, just merge), which is why CRDT-based tools market themselves as “offline-first”.

Presence and cursor#

Cursor positions are sent as a separate, lower-priority message stream. They’re ephemeral — not stored. Each cursor message carries the user_id and a position. On render, each peer overlays remote cursors on its local view, transforming positions through any local pending ops to keep them visually consistent.

Version history#

Every K snapshots are kept forever as “named versions”. Users can browse history, see who edited what, and revert.

Reverting is itself an op (a big “set content” op that the shard owner orders normally). It doesn’t violate convergence; it’s just a large edit.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: hot documents. A doc with 100 concurrent editors generates 1,500 ops/sec on one shard owner. CPU-bound by the transform function. Mitigations: batched ops (collapse multiple keystrokes into one op every 50 ms), and cap concurrent editors at a soft limit (Google Docs caps live editing at ~100 in one tab).

Bottleneck #2: shard owner failover. The owner is a single point. Replicate state to a hot standby; on owner failure, the standby takes over, but in-flight ops may need re-sending by clients. Clients detect missing ACKs after a timeout and resend with the same client_clock — idempotent.

Bottleneck #3: op-log growth. A long-lived doc accumulates millions of ops. Snapshots every K ops bound the log to recent activity. Pre-snapshot op archives are not needed for live editing — only for “show me edit history”.

Alternative I’d push back on: a “last-write-wins” model with naive merging. Loses concurrent edits; users notice immediately (“my paragraph disappeared!”). OT or CRDT is non-negotiable for real collaborative editing.

What breaks first at 10× scale (10 M concurrent active docs): gateway memory for WebSocket state. Already significant; at 10× the fleet grows linearly. The shard-owner layer scales fine — docs are independent.

Companies this resembles#

Google Docs (OT), Microsoft Office Online (OT), Figma (custom CRDT-ish multiplayer), Notion (Yjs-based CRDT), Linear, Miro. Open-source: ShareJS / OT.js, Yjs, Automerge.

WhatsApp — same stateful-router pattern for long-lived connections.
Distributed Cache — used for active-doc working state.
Consistency Models — the theoretical backdrop for OT vs CRDT.
Sequencer — the shard owner is a per-document sequencer.