End-to-end-encrypted messaging at scale: presence, delivery receipts, group chats, media.
Step 1 — Clarify Requirements#
Functional
- Send 1:1 and group text messages between users identified by phone number.
- End-to-end encryption: server cannot read message contents.
- Delivery receipts (sent / delivered / read) and presence (online, last-seen).
- Media messages (images, audio, video) handled separately from text.
- Group chats up to ~1,000 members.
- Out of scope here: voice / video calls, channels, payments, status (story-equivalent).
Non-functional
- 99.99% availability for message delivery.
- p99 send-to-deliver latency under 500 ms for users on the same continent.
- 2 B users, ~100 B messages/day, ~1 M messages/sec peak.
- Messages must persist until delivered; once delivered to all recipients, the server can drop them.
Step 2 — Capacity Estimation#
- Daily messages: 100 B → ~1.2 M messages/sec average, ~5 M/sec peak (New Year’s Eve, World Cup final).
- Active concurrent connections: ~500 M long-lived WebSockets at any moment.
- Per-message bytes (encrypted text): ~200 B header + ~200 B ciphertext = ~400 B. Net network: 1.2 M × 400 = 500 MB/sec ingest.
- Media: ~10 B media messages/day; average 500 KB. ~60 GB/sec of media bytes during peak.
- Storage (transient): messages are stored only until delivered. With ~5% offline recipients holding messages for ~12 hours, the queued working set is 100 B × 0.05 × 0.5 day × 400 B ≈ 1 TB per partition tier.
- Storage (media): media blobs persist ~30 days for re-download. 60 GB/sec × 86 400 × 30 × portion-retained ≈ single-digit EB in active media storage.
The system is a stateful router for short messages over long-lived connections — not a database. That reshapes everything.
Step 3 — System Interface#
The transport is a persistent WebSocket / XMPP-like protocol, not HTTP request/response. Logical operations:
CONNECT (auth handshake, presence registration)SEND { to, ciphertext, msg_id } -- 1:1 messageSEND_GROUP { group_id, ciphertext, msg_id }ACK { msg_id, state: sent|delivered|read }PRESENCE { online | last_seen_at }GET_KEY (prekey bundle for E2EE session setup)
POST /media (out-of-band HTTPS for media upload; returns blob_uri)GET /media/:id (download by uri; encrypted blob)Media goes over plain HTTPS; the blob itself is encrypted with a per-message symmetric key delivered via the messaging channel.
Step 4 — High-Level Design#
presence service ▲ │phone ── TLS ─→ edge gateway ─┴─→ session manager (which user is on which gateway?) │ │ │ ▼ └→ message router → recipient's edge gateway → recipient phone │ ▼ offline queue (KV, per-user, drained on reconnect)
HTTPS uploads → media blob store → CDN download for recipientsThe session manager is the central piece: given a phone number, return the gateway IP currently hosting their session. Without this, a “send to Alice” can’t know which of 1,000 gateways to deliver it to.
Step 5 — Data Model#
Sessions (Redis, sharded by user):
key: session:{user_id} → { gateway_ip, connected_at, app_version }TTL: refreshed by heartbeat every 30 sOffline queue (per-user; written when recipient is offline, drained on reconnect):
table offline_messages user_id uuid PK msg_id timeuuid CK ciphertext bytes sender_id uuid enqueued_at timestampMessages older than ~30 days that never delivered are dropped (and the sender sees a “not delivered” state forever).
Groups:
table groups group_id uuid PK member_ids set<uuid> metadata { name, admin_ids, created_at }Prekey bundles (for Signal Protocol session bootstrap):
table prekeys user_id uuid PK prekey_id int CK bundle bytes // identity_key + signed_prekey + one-time prekeyPrekeys are consumed (single-use) when starting a new session.
Step 6 — Detailed Design#
End-to-end encryption#
Uses the Signal Protocol (X3DH for session setup, Double Ratchet for ongoing messages). Each (sender, recipient) pair has a ratcheting session; per-message keys are derived and discarded.
The server never sees plaintext. It routes ciphertext blobs and tracks metadata (who sent to whom, when delivered). Group messages are encrypted once per recipient — a 100-member group send means the sender encrypts 99 times, producing 99 distinct ciphertexts the server fans out.
Sending a message (the hot path)#
Alice sends to Bob:1. Alice's app encrypts plaintext with the Alice→Bob ratchet → ciphertext.2. Open WebSocket SEND { to: bob_id, msg_id, ciphertext }.3. Edge gateway A receives, looks up session:bob_id → gateway B's IP.4. Forward ciphertext to gateway B.5. Gateway B pushes down Bob's WebSocket → Bob's app decrypts.6. Bob's app sends ACK delivered. ACK routes back via the same hop chain.7. Bob reads → app sends ACK read.
If Bob is offline at step 3: Write to offline_messages(bob, msg_id, ciphertext). When Bob reconnects, his client drains the queue on connect.The “send to deliver” latency is dominated by the inter-gateway hop. Within a region, that’s <50 ms. Cross-continent: 200-300 ms.
Group messages at scale#
A 1,000-member group send is 999 sends from the perspective of the network. Two design choices:
The sender-fan-out approach is cleanly E2EE: each pairwise message is encrypted with its own ratchet. Group secrecy survives any number of member compromises that don’t include the recipient in question.
Presence#
The hardest soft-state problem in messaging.
- A user’s WebSocket connection sends a heartbeat every 30 s.
- The session record’s TTL is refreshed on each heartbeat.
- “Online” = session record present.
- “Last seen” =
connected_atof the most recent expired session (snapshotted before expiry).
A naive presence broadcast (notify every contact when Alice goes online) is N×M traffic. The fix: clients query presence on-demand when opening a chat, not on connect. The server caches per-user presence in a hot Redis with sub-second TTL.
Delivery and read receipts#
Receipts are themselves messages on the same channel: ACK { msg_id, state, recipient_id }. They flow back through the same router. The sender’s UI shows one check (sent to server), two checks (delivered to recipient device), two blue checks (read).
Settling these in order matters for UX: a “read” before a “delivered” looks broken even though it’s mathematically fine.
Media#
1. Alice's app uploads encrypted blob to media service over HTTPS.2. Media service returns blob_uri (and stores in CDN-backed origin).3. Alice sends a message containing blob_uri + the symmetric key (encrypted via the ratchet).4. Bob's app decrypts the message body to retrieve blob_uri + key.5. Bob's app downloads the encrypted blob from CDN, decrypts locally.The CDN is fronting ciphertext — it doesn’t know what it’s serving. Lifecycle is 30 days from upload.
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: persistent connections. 500 M WebSockets is the memory bottleneck. With ~10 KB of kernel + userspace state per connection, that’s 5 TB of RAM across the gateway fleet. WhatsApp historically ran a hand-tuned Erlang VM that holds ~2 M connections per box — so the fleet is ~250 boxes per region, growing linearly with users.
Bottleneck #2: the session manager. A hot key for any one gateway’s user list. Sharded by user_id mod N; failovers re-bind sessions to new gateways via the persistent connection’s natural reconnect cycle.
Bottleneck #3: cross-region routing. A user in Mumbai messaging a user in São Paulo passes through at least two gateways with a global hop. The latency budget under 500 ms p99 only just absorbs this; under 300 ms it doesn’t, and we either accept higher p99 or invest in per-pair private backbone.
Alternative I’d push back on: making the server able to read messages so it can do server-side fan-out for groups. Tempting (single ciphertext per group send vs N-1) but breaks the entire E2EE guarantee. The increase in sender-side CPU and bandwidth is a worthwhile cost.
What breaks first at 10× scale (10 B messages/sec peak): the offline queue’s per-user partition. A user offline for a week in a high-traffic group accumulates a queue that takes minutes to drain on reconnect, blocking the UI. Mitigation: cap per-user queue size, drop oldest, surface “this conversation has been pruned” on reconnect.
Companies this resembles#
WhatsApp, Signal (open-source reference for the protocol), iMessage (Apple-specific delivery layer, same E2EE shape), Telegram (server-side optional, less strict crypto by default).
Related systems#
- Pub/Sub — the conceptual layer underneath presence broadcasts.
- Rate Limiter — spam prevention on send endpoints.
- Blob Store — backs the media path.
- Quora — different shape (Q&A), shared fan-out problem for notifications.