Facebook Messenger — System Design

Step 1 — Clarify Requirements#

Functional

1:1 and small-group (up to ~250 members) chats. Messages can be text, images, links, replies, reactions.
Real-time delivery while the recipient is online. Push notification when they are not.
Typing indicators, read receipts, online/last-seen presence (user-configurable).
Threads and reactions on individual messages.
Multi-device: a user signed into web, iOS, Android sees the same chat history on all three with consistent unread state.
Out of scope: voice/video calls, end-to-end encryption (assume transport-encrypted only — Messenger has a separate E2EE mode treated as another writeup), Stories.

Non-functional

99.99% availability — Messenger going dark is a major incident.
p99 send-to-deliver latency under 500 ms when both ends are online.
1 B+ DAU, peak concurrent connections roughly 200 M.
Eventual consistency on presence and read receipts is fine. Strong ordering of messages within a single conversation is not negotiable.

Step 2 — Capacity Estimation#

Connections: 200 M concurrent WebSocket / long-poll sessions at peak. With ~10 K connections per gateway box, that’s ~20 K gateway hosts at peak (multi-region).
Message rate: average user sends ~10 messages/day → 10 B/day → ~120 K msg/sec average, ~500 K/sec peak. Group fan-out roughly doubles raw deliveries (avg group size ~3).
Storage: 10 B msgs/day × ~500 B (text+metadata) = 5 TB/day, ~1.8 PB/year. With attachments (10% of messages, 200 KB avg) tack on ~200 TB/day which actually lives in blob storage, not the chat DB.
Signals (typing, read receipts, presence pings): roughly 5× the message rate. ~600 K/sec average, ~2.5 M/sec peak of small ephemeral events. Critically, none of this should hit durable storage.
Push notifications: roughly half of messages target an offline recipient → ~60 K APNs/FCM pushes/sec average, ~250 K/sec peak.
Bandwidth on the long-poll path: 200 M connections × 100 B/min heartbeat ≈ 30 GB/min keepalive overhead — small per-connection, large in aggregate.

Where the numbers steer the design: it’s connection-density, not message rate, that dominates infrastructure cost. A WhatsApp-style “C10M per box” gateway tier is the only economical way to terminate 200 M sockets.

Step 3 — System Interface#

The wire protocol matters more than a REST surface. Two channels:

Persistent socket (WebSocket, MQTT-style framing):

client → CONNECT(auth_token, device_id, last_seen_seq)
server → CONNACK(session_id, missed: [msg, msg, ...])

client → SEND(client_msg_id, conv_id, body, ts)        // idempotent on client_msg_id
server → ACK(client_msg_id, server_msg_id, server_ts)

server → DELIVER(server_msg_id, conv_id, sender_id, body, ts)
client → DELIVERED(server_msg_id)
client → READ(server_msg_id)                            // explicit receipt

client → TYPING(conv_id, started/stopped)               // ephemeral, not stored
server → TYPING(conv_id, user_id, started/stopped)

REST-ish auxiliary (for the cold/cross-device sync path):

GET   /conversations?cursor=...
GET   /conversations/:id/messages?before=<seq>&limit=50
POST  /conversations  (create new)
GET   /presence?user_ids=[...]

The client_msg_id is critical: a network blip on send means the client retries; the server dedups on (sender_id, client_msg_id).

Step 4 — High-Level Design#

                                  ┌──── push fanout ──→ APNs / FCM
                                  │
[mobile] ──┐                      │
           ├─→ edge LB ─→ gateway tier ─→ message router ─→ message store (Cassandra)
[web]    ──┤   (TLS)      (sticky)            │                        │
[desktop]──┘                                  │                        │
                                              ├─→ presence service (Redis)
                                              ├─→ counter service (unread)
                                              └─→ ephemeral bus (typing, receipts)
                                                       │
                                                       └─→ subscribers (gateway tier delivers)

  ┌──── attachments ─→ blob store (S3) ←─── thumbnailer / transcoder ───┐
  └────────────────────────────────────────────────────────────────────┘

Two layers do the work: a stateful gateway tier that owns the socket-to-user mapping (sticky), and a stateless message router behind it that does the actual fan-out, write, and downstream notification.

Inter-server delivery uses the ephemeral bus: when Alice on gateway-A1 sends to Bob on gateway-B7, the router publishes to a topic keyed by Bob’s user-id; gateway-B7 (subscribed to that topic) pushes it down the socket. Either side losing its socket — we replay from message store using the last-acked server_msg_id.

Step 5 — Data Model#

Conversations:

table conversations
  conv_id      uuid     PK
  type         enum(direct, group)
  created_at   timestamp
  title        string?         // groups only
  last_msg_id  bigint          // hot row; updated on every send
  member_count int

table conv_members
  conv_id      uuid     PK
  user_id      uuid     CK
  joined_at    timestamp
  role         enum(member, admin)
  unread_count int
  last_read_msg_id bigint

Messages (the hot table — partition by conv_id):

table messages
  conv_id      uuid       PK
  msg_seq      bigint     CK     // monotonic per-conv, assigned by router
  sender_id    uuid
  body         text/blob
  attachments  list<url>
  reply_to     bigint?           // threaded reply
  reactions    map<user_id, emoji>     // small map; denormalized
  created_at   timestamp
  edited_at    timestamp?
  deleted_at   timestamp?

Partition by conv_id so that fetching a conversation’s history is a single range scan. The clustering key msg_seq is per-conversation monotonic — assigned by a sequencer co-located with the conversation’s home shard. This is what gives us per-conversation strong ordering.

User inbox (cross-conversation index):

table user_inbox
  user_id      uuid       PK
  updated_at   timestamp  CK desc
  conv_id      uuid
  preview      string
  unread       int

This is what powers the conversation list when the user opens the app. Updated on every send to any conversation they’re in.

Ephemeral state (Redis, never durable):

presence:{user_id}              → { state, last_seen, devices: [...] }
typing:{conv_id}                → set of (user_id, expires_at)
socket:{user_id}                → set of gateway_ids holding active sockets

Step 6 — Detailed Design#

Per-conversation ordering#

A naïve UUID timestamp for msg_seq collides under burst traffic. We assign sequence numbers from a sequencer service that owns ranges per conv_id:

router → sequencer.reserve(conv_id, n=64)    // batch reservation
returns: [base, base+1, ... base+63]

The router uses these locally until exhausted, then refills. A router crash leaks at most 63 sequence numbers per active conversation — gaps are tolerated by clients (they don’t display [gap], they just skip).

Fan-out within a conversation#

on SEND(conv_id, body, client_msg_id):
  msg_seq = sequencer.next(conv_id)
  write messages row (durably, quorum)
  publish on ephemeral_bus topic conv:{conv_id} { msg_seq, sender, body, ... }
  ack(client_msg_id, msg_seq) to sender
  update conv.last_msg_id, conv_members.unread_count (atomic batch)
  enqueue push-notification job for offline members

For a 250-person group, the publish is a single bus message; the bus fans out to whichever gateways are subscribed for any of the conversation’s members. Cost is O(active_gateways), not O(members).

Typing and read receipts#

Typing indicators must never write to durable storage. The flow:

client → TYPING(conv) → gateway → ephemeral_bus topic conv:{id}:typing
            → subscribed gateways → DELIVER TYPING(...) down socket

A typing indicator expires after 4 seconds without a refresh — the consuming client uses local timers, not a server-side TTL. If you store typing pings durably you’ll watch the disk fill in real time; this is the most common rookie mistake in a Messenger-style design.

Read receipts are durable (they advance last_read_msg_id) but written async — losing a few read receipts on gateway crash is cosmetic.

Multi-device sync#

A user signs in on a new device — the device replays missed messages from last_seen_seq per conversation. Order is preserved by msg_seq. Reactions and edits arrive as separate events with their own seq; clients apply them to the local message store by foreign key.

The hard case is unread count consistency across devices: Alice on iOS reads message 50, but her web tab thinks it’s at 47. We solve this by treating the read marker as a CRDT-like maximum:

on READ(conv_id, msg_seq) from any device:
  last_read = max(stored_last_read, msg_seq)
  broadcast to all sessions of this user

The broadcast is via the ephemeral bus, keyed by user_id. Any session subscribed to its own user topic re-syncs the unread count locally.

Presence#

Presence is the most over-engineered part of any messenger if you let it be. Three tiers:

Push presence — every connect / disconnect updates a central Redis key, and every interested viewer subscribes. Real-time but the publisher-subscriber matrix explodes at scale (your friend graph × event rate).

Pull presence — clients ask “is X online?” on demand, with caching. Much cheaper but visible lag. We use pull-with-TTL (5 s) and only push for “currently in this conversation with me” — the small set that matters.

We adopt the pull model for the global friend list and a push model scoped to the conversation surfaces a user has open. Web’s chat sidebar can show a slightly stale dot; the conversation header has to be sharp.

Push notifications#

Offline detection: when a message is published, the gateway checks socket:{user_id} — if empty, enqueue a push notification job. The job batches in a 200 ms window (so 5 messages arriving back-to-back send a single notification “Bob: 5 messages”) and then dispatches to APNs / FCM. APNs and FCM are themselves rate-limited per-app; we use multiple bundle topics and prioritization.

Web-first reach#

A web client behind corporate firewalls often can’t keep a long-lived WebSocket open. The gateway supports a graceful long-poll fallback (the same wire framing, just over HTTP). Battery-sensitive mobile clients use MQTT over a single TCP connection multiplexing all their topics. The session is keyed by (user_id, device_id); multiple devices each hold their own.

Attachments#

Photos and videos go through a separate ingress: client uploads to a signed S3 URL, server receives only the URL. The messages.attachments column stores the URL plus a content-hash for dedup. Thumbnails are generated async; clients see a low-res preview within ~1 second and the full asset within ~3.

Latency budget (p99 500 ms end-to-end)#

client → edge:                  30 ms (regional)
gateway hand-off → router:       5 ms (intra-DC)
sequencer reserve (amortized):   0.5 ms
durable write quorum:           15 ms
ephemeral bus publish:           3 ms
fanout subscriber → gateway:     5 ms
gateway → recipient client:     30 ms
                          total: ~90–120 ms p50, ~400 ms p99

The p99 tail is dominated by cross-region delivery and slow networks. We deliberately under-promise (500 ms) so that 5% of cross-continent deliveries don’t burn the SLO.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: connection density on the gateway tier. 200 M concurrent sockets at peak. Linux kernel tuning (file descriptors, ephemeral port reuse, TCP buffer sizes) becomes a first-class engineering effort. A bad rollout — say, a TLS library upgrade that adds 50 KB of memory per connection — translates directly into 10 PB of additional RAM. Capacity planning happens at the connection level, not the QPS level.

Bottleneck #2: ephemeral bus throughput. 2.5 M signal events/sec peak across 200 M subscribers is a non-trivial pub-sub workload. Sharded topic spaces (e.g. partition conv:* topics across N brokers by hash) keep any single broker under load, but cross-broker subscription rebalances during deploys are the riskiest operational moment.

Bottleneck #3: per-conversation hot rows. A 250-person work group all online during a launch chat: 50 messages/min, plus 250 typing-start/stop pairs, plus 250 read receipts on every message. The conversation’s home shard sees a hundred KB/s of writes for one row. We mitigate with batching at the router (coalesce read-receipt updates per 500 ms window) and refusal to durably store typing.

Bottleneck #4: the user inbox table. user_inbox is updated on every send to any conversation the user is in. A user in 50 active conversations gets thousands of inbox writes/day. We shard by user_id and accept that the row gets hot; the conversation list is read once on app open and cached client-side. Updates flow in only when conversations actually change.

Alternative I’d push back on: a single “messages” event-log shard per user. Tempting because it makes “show my unread” trivial, but it doesn’t survive group chats — every member would write to every other member’s log. The conv_id-partitioned store with a derived user inbox is the right shape.

What breaks first at 10× scale (10 B DAU equivalent): the cross-region delivery fabric. The ephemeral bus today assumes most conversations have both ends in roughly the same region; at extreme scale you get more cross-region threads (multinational families, distributed work groups) and the inter-region replication bandwidth becomes a billing-line item.

Companies this resembles#

Facebook Messenger, Instagram DMs, Slack (with channels instead of conversations and a different presence model), Discord (similar gateway + bus architecture, gaming-tuned), Signal (the schema looks similar but E2EE shifts the data model considerably).

WhatsApp — the close cousin; mobile-first, end-to-end encrypted, different push posture.
Pub-Sub — the substrate the ephemeral bus is built on.
Distributed Cache — the Redis layer for presence and socket maps.