Dropbox (File Sync) — System Design

Step 1 — Clarify Requirements#

Functional

A user installs a desktop client; a folder on the local filesystem mirrors a folder in the cloud. Edits in either direction propagate to all devices linked to the account.
Multi-device: laptop, phone, work desktop, web. All converge to the same view within seconds when online.
Selective sync: I can mark a 500 GB photo archive as “online only” on my laptop while keeping it cached locally on my desktop.
File versioning: I can recover any prior version for 30 days (extended for paid tiers).
Sharing: invite another user to a folder; they get read-only or read-write.
Out of scope: rich-document collaboration (operational transform, presence cursors — separate design like Google Docs), the mobile photo-uploader background-sync corner-cases, native filesystem driver implementation.

Non-functional

99.99% availability for the sync API.
11-nine durability for stored data (S3-class). Never, ever lose a customer file.
Latency to “saw your edit” on a peer device: under 5 seconds on a healthy network.
500 M users across all tiers, ~100 EB stored data (heavy long-tail of cold archives).
Bandwidth efficiency is a top-three concern. A 1 GB file with a 1-byte edit must not retransmit the gigabyte.

Step 2 — Capacity Estimation#

Users: 500 M registered, ~100 M MAU, ~20 M DAU active syncs at any given peak hour.
Average per-user storage: 200 GB across paid + free tiers (skewed by enterprise) → ~100 EB total.
Object count: ~200 B logical files, but after chunking into 4 MB blocks → ~25 trillion blocks in the block store. The block store is the dominant index problem.
Write QPS: average user edits ~50 files/day → 10 B edits/day → ~120 K edits/sec average, ~500 K/sec peak. Each edit is potentially 1+ chunks uploaded.
Upload bandwidth: edits average ~500 KB after delta-encoding (most are partial; whole-file rewrites are rare). 120 K × 500 KB = ~60 GB/sec average ingress, ~250 GB/sec peak.
Metadata writes: every sync event is a metadata row update. ~300 K metadata writes/sec average (an edit becomes 2–3 metadata rows: file version, block manifest, namespace event).
Notification fanout: each edit notifies ~2 average linked devices → ~250 K notifications/sec average.
Dedup savings: across 25T blocks, content-hash dedup yields roughly 30% reduction (corporate distros, OS files, identical photos forwarded around). 100 EB logical → ~70 EB physical.

The numbers steer the architecture: chunked block-store ingest is the engineering core; metadata correctness is the business core; bandwidth-efficient deltas are the cost core.

Step 3 — System Interface#

The sync API is split into block plane and metadata plane:

# Block plane (the heavyweight path)
POST  /blocks/upload
        Body: 4 MB chunk bytes + content_hash
        Returns: { stored: true, block_id }
        Idempotent on content_hash (already-stored → instant return)

GET   /blocks/:block_id
        Returns: bytes

# Metadata plane (the chatty path)
POST  /commit
        Body: { path, parent_rev, block_list: [hash, hash, ...], mtime, size }
        Returns: { rev, conflict?: {...} }

GET   /delta?cursor=<opaque>
        Returns: { entries: [...changes since cursor...], next_cursor, has_more }
        Used by clients for long-poll change streams.

POST  /share
        Body: { path, recipient, permission }

The commit endpoint is the heart of the design: the client says “I have these blocks for this path at this parent revision.” If parent_rev matches the server’s current rev, the commit succeeds and the file points at the new block list. Otherwise we have a conflict.

Step 4 — High-Level Design#

                                       ┌──── notify ──→ pub/sub ──→ peer device long-polls /delta
                                       │
[client]─chunk─→ block gateway ─→ block store (S3-class, content-addressed)
   │                                   │
   │                                   ├──→ dedup index (KV: hash → block_id)
   │                                   │
   └─commit──→ metadata service ──→ metadata DB (sharded by namespace_id)
                       │
                       ├──→ change-log table (per-namespace sequence)
                       │
                       └──→ namespace cache (recent paths, hot dirs)

[search / share UI]──→ web app ──→ same metadata service

                          ┌── LAN discovery (mDNS) ──┐
[client] ←── peer LAN sync ───────────────────────── [client]   (same account, same subnet)

The block plane and metadata plane scale independently. The block plane is dominated by bytes (CDN-like ingress and egress with content-addressed dedup); the metadata plane is dominated by writes per second and per-namespace ordering guarantees.

The LAN-sync rail is a cute optimization: clients on the same network discover each other via mDNS and exchange blocks directly. A 5 GB folder copied from one office laptop to another over the LAN saves all 5 GB of ingress and egress on the central system.

Step 5 — Data Model#

Namespaces (a shared folder or a user’s root):

table namespaces
  namespace_id   uuid    PK
  root_path      string
  owner_id       uuid
  members        list<{user_id, permission}>
  current_seq    bigint           // monotonic per namespace

Sharing a folder = creating a namespace with multiple members. A user’s “root” is also a namespace, with exactly one member.

Paths (sharded by namespace_id):

table paths
  namespace_id   uuid    PK
  path           string  CK
  rev            bigint           // monotonic version of THIS path
  size           int
  mtime          timestamp
  is_dir         bool
  block_list     list<hash>       // ordered chunk hashes (empty for dirs)
  deleted        bool
  deleted_at     timestamp?

A file’s content is referred to by hash; the table itself doesn’t store bytes. Two users with identical files share storage automatically via the dedup index.

Change log (the event source for sync clients):

table change_log
  namespace_id   uuid    PK
  seq            bigint  CK     // monotonic, dense per namespace
  path           string
  rev            bigint
  op             enum(add, modify, delete, rename, share)
  actor_id       uuid
  ts             timestamp

Clients pull change_log filtered to their accessible namespaces, since their last-seen seq. This is what /delta returns.

Block store (content-addressed; physical layer is S3-like):

KV index:  content_hash (SHA-256) → block_locator (bucket, key)
Storage:   immutable 4 MB blobs, written once, read many.

Critically, blocks have no path, no namespace, no per-user metadata. They are pure content. This is what makes cross-account dedup safe and cheap.

Step 6 — Detailed Design#

Chunking#

Files are split into 4 MB blocks. Smaller and the per-block overhead (hash, metadata row) dominates; larger and small edits force re-uploading too much. A 1 GB file is 256 chunks; a 100 KB file is one chunk with the rest empty.

We use content-defined chunking (CDC) with a rolling hash (Rabin fingerprint) for files larger than a threshold (say 64 MB). Fixed boundaries would make any insert in the middle of a file re-chunk everything from that point on; CDC anchors chunk boundaries to content patterns so an insert only changes 1–2 chunks regardless of file size.

Fixed-size chunks — split every 4 MB. Simple, fast, predictable. Bad for files with mid-file inserts (a single byte inserted at position 0 rewrites every subsequent chunk).

Content-defined chunks — boundaries chosen by a rolling hash of the content. An insert affects only the chunk containing it; the rest of the file’s chunks stay identical. Slightly more CPU on the client; much less network.

For files smaller than the threshold, fixed chunking is fine and simpler. The break-even is usually around 64 MB — most photos and documents stay fixed.

Delta sync#

Before uploading anything the client computes hashes for all chunks of the local file and sends just the hash list. The server responds with the subset it doesn’t already have. Then the client uploads only those blocks.

client → /commit/preflight { path, hash_list }
server → { need: [h3, h7] }       # already-have: h1, h2, h4, h5, h6
client → /blocks/upload h3
client → /blocks/upload h7
client → /commit { path, hash_list, parent_rev }
server → { rev: 1024 }

For a small edit in a large file, this transfers only a few MB regardless of file size. For a duplicate of a popular file (an OS installer, a viral image) it transfers zero — the dedup index already has every chunk.

Conflict resolution#

The basic shape is optimistic concurrency. A commit includes parent_rev; the server compares it to the current head:

parent_rev == current: fast forward. Write the new rev, advance the path’s head, append to change_log.
parent_rev < current: conflict. The client missed a remote edit.

On conflict we don’t silently merge file contents — that’s how data loss happens at consumer scale. Instead, we materialize both versions:

Original: /work/proposal.docx (rev 5, edited by Alice at 10:01)
Conflict: /work/proposal (Bob's conflicted copy 2024-05-16).docx  (rev 5.1)

The conflicted copy is its own path. Both versions are durable; the user chooses which to keep. This is the rule that lets us never lose customer data: when in doubt, fork.

For directory-level operations (rename, delete) we use server-side serialization keyed by the parent directory. Two simultaneous renames of the same file are linearized; the loser sees a normal conflict.

Change-log replication#

Each client maintains a per-namespace cursor. When connected, it long-polls /delta?cursor=...:

client → /delta?cursor=NS=abc:seq=512  (held open ~30 s)
   ...edit happens elsewhere...
server → { entries: [{seq:513, path:'/foo', op:'modify', ...}], next_cursor: '...:seq=513' }
client downloads blocks for /foo, applies locally
client → /delta?cursor=NS=abc:seq=513   (poll again)

The cursor format encodes per-namespace sequence numbers in an opaque blob — clients treat it as a magic string. Disconnections are seamless: reconnect and continue.

The notification fanout uses a pub/sub layer so that the long-poll endpoint doesn’t have to busy-wait: when a namespace advances, a publish to topic ns:{id} wakes any open long-poll for that namespace’s subscribers.

LAN sync#

Clients on the same account discover each other on the local network via mDNS. When the metadata service tells client A “you need blocks h3 and h7”, A first asks any peer on the LAN — over an authenticated channel — whether it has them locally. If yes, A pulls from the peer directly; if no, A falls back to the central block store.

This is mostly a cost optimization: a household with two devices syncing a vacation photo dump uploads the bytes once to the central system and then transfers locally to the second device, saving 50% of the wire bytes. Authentication uses a short-lived token issued by the metadata service, so a hostile LAN peer can’t poke at your files.

Selective sync and on-demand files#

Some files live in the local mirror as placeholders (zero-byte stubs with the same name and metadata). Opening the placeholder triggers a filesystem driver hook that hydrates the file from the block store. This integrates with macOS File Provider and Windows Cloud Files APIs on those platforms.

The implication for the design: the metadata plane is the source of truth for what files exist; the block plane is consulted lazily for what they contain. The split is real and matters.

Block-store scalability#

25 trillion blocks is the index size that matters. We don’t keep a single global hash → block_id map — that’s too big. Instead, the hash space is sharded:

shard_id = hash_prefix[0:2]   (256 shards)
each shard: KV store of hash → block_locator

Each shard holds ~100 billion entries with bounded growth (4 MB blocks × dedup means each unique hash represents 4 MB physical). A typical shard’s index size is ~10 TB (entries are small) — fits comfortably on SSDs.

Cold blocks (untouched for 90 days) migrate to a tiered cold storage class. Most user files are never re-read after the first month; this is the single biggest cost lever in the system.

Latency budget for a “they edited it 5 seconds ago” experience#

local save → client detects via fs watch:   100 ms
chunking + hashing local file:               50–200 ms (size dependent)
preflight + upload of changed blocks:        200–1000 ms
commit + change-log write:                  20 ms
notification publish to subscribers:         5 ms
remote client long-poll wake:                10 ms
remote download of needed blocks:            200–1000 ms
remote apply to local filesystem:            50 ms
                                       end-to-end: ~1–3 s typical

5 s is the SLO; under good conditions we beat it routinely.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: per-namespace serialization. All metadata writes for a namespace must serialize through one shard to maintain dense monotonic sequence numbers and consistent conflict detection. A 200-person enterprise shared folder during a rapid edit storm (say a sales team prepping for a quarterly close, all touching the same drive) sees the shard’s CPU spike. Mitigations: per-path locking instead of per-namespace, and batching commits at the client when many local saves happen in a 1-second window.

Bottleneck #2: dedup index hot keys. A globally popular file (e.g. a viral image saved by millions of users) has a content hash that’s looked up millions of times. We layer a CDN-style cache in front of the dedup index. The popular hash is served from edge memory; the long tail goes to the sharded KV.

Bottleneck #3: long-poll connection density. 20 M active syncing devices means 20 M open long-poll connections. Same gateway-density problem as messaging systems; we use HTTP/2 with multiplexed streams and a stateful gateway tier that maps (user, namespace) subscriptions to broker subscriptions.

Bottleneck #4: cold storage retrieval. A user who hasn’t touched a folder in 6 months suddenly wants the whole archive. Cold-tier reads have multi-second TTFB and per-GB retrieval fees. We mask this with a “preparing” UI and an explicit tiered-restore job for >10 GB; users with frequent cold-access patterns get auto-promoted back to warm.

Alternative I’d push back on: server-side automatic 3-way merging of conflicting file contents. Customers ask for it constantly. It works for text files; it produces silent corruption for everything else (binary documents, photos, spreadsheets, databases). The conflicted-copy rule looks dumb but is the only safe default. Programmatic merges belong in collaborative-document systems (the Google Docs CRDT model), not in a general-purpose file sync.

What breaks first at 10× scale (1 trillion users, 1000 EB): the change-log retention. Today we keep dense per-namespace sequences forever for client cursors; at extreme scale we’d compact older history into snapshots and require offline clients (>30 days disconnected) to do a full re-baseline rather than replay. The migration to “snapshot + delta” is the bigger architectural change that’s already on the runway.

Companies this resembles#

Dropbox (the canonical), Google Drive (similar architecture with a different conflict story — they lean on real-time CRDTs for Docs but still use this shape for arbitrary files), OneDrive (with deeper Windows shell integration), iCloud Drive (Apple’s variant, heavier on opportunistic device-to-device sync), Box (enterprise-tuned, same fundamentals).

Blob Store — the substrate the block store sits on.
Consistency Models — the optimistic-concurrency / conflicted-copy story is a direct application.
Key-Value Store — the dedup hash index is a sharded KV at planetary scale.