Publish / Subscribe
Topic-based fan-out, ordering guarantees, filtering, retention, and the gap between pub-sub and queues.
Use cases#
Publish/subscribe broadcasts each message to every interested subscriber. Where a queue routes each message to exactly one consumer, pub-sub fans the same message out to N. The dominant use cases:
- Event-driven architecture —
OrderPlacedis published once; inventory, billing, analytics, and customer-comms each subscribe independently. - Notification fan-out — a tweet from a celebrity goes to every follower’s feed-builder service; a Slack message to every connected client.
- Cache invalidation — a write to the primary store publishes “key K is dirty”; every cache layer subscribes and evicts.
- Real-time updates — chat, presence, live dashboards, multiplayer game state.
The boundary with Distributed Messaging Queue: queues are 1-of-N consumption; pub-sub is N-of-N. Kafka does both via consumer groups; SNS/SQS pairs SNS (pub-sub) with SQS (queue) to combine semantics.
Functional requirements#
- Publishers send messages to topics (or channels, subjects).
- Subscribers register interest in topics; each subscriber receives every message published while it was subscribed.
- Optional: pattern matching (
order.*), content filtering (subscribers receive only messages matching predicates). - Optional: durable subscriptions that survive subscriber downtime (offsets, replay).
- Optional: ordering within a partition / key.
Non-functional requirements#
- Latency: publish-to-deliver p99 under 100 ms for most uses; under 10 ms for real-time UX (chat, presence).
- Fan-out: a single publish must reach 1000s-1M+ subscribers. Architecture has to assume read fan-out >> write.
- Throughput: a busy topic can see millions of messages/sec aggregated across subscribers.
- Backpressure: a slow subscriber must not slow down the publisher or other subscribers. Per-subscriber buffering or dropping is required.
High-level design#
┌── subscriber A (durable, offset 1234) │publishers ──> topic / partition ──┼── subscriber B (durable, offset 1230) log │ ├── subscriber C (ephemeral, latest only) │ └── subscriber D (filtered: type=urgent)
each subscriber tracks its own progress; broker does not retain "delivered" state per messageTwo topology styles:
Broker-based (Kafka, Pulsar, Pub/Sub) — central broker stores messages, subscribers pull/push from it.Mesh / broker-less (NATS core, Redis Pub/Sub) — broker just forwards; nothing is retained.Broker-based dominates the durable-subscription space; broker-less dominates ephemeral real-time fan-out.
Detailed design#
Push vs pull delivery#
Push — broker sends to subscriber endpoint (HTTP webhook, WebSocket, gRPC stream). Used by SNS, Google Pub/Sub (push subscriptions), most chat systems. Pro: low latency, broker controls fan-out. Con: slow subscriber → broker has to drop or buffer, retries are tricky.
Pull — subscriber asks broker for next batch. Used by Kafka, Pulsar, SQS, Pub/Sub (pull subscriptions). Pro: backpressure for free, subscriber controls flow. Con: subscriber needs to be running and polling; cold start hurts latency.Most production systems offer both. Use push when you need real-time and own the subscriber endpoint; use pull when subscriber load is uneven and you want it self-paced.
Durable vs ephemeral subscriptions#
Ephemeral — subscriber connects, receives messages from "now"; on disconnect, history is lost. Redis Pub/Sub, NATS core, WebSocket-based UIs.
Durable — subscriber has a persistent identifier; on reconnect, broker replays from the last committed offset. Kafka consumer groups, Pulsar subscriptions, Pub/Sub subscriptions.The durable mode is what makes pub-sub useful for service architectures — a service can be redeployed, crash, or pause, and pick up exactly where it left off.
Fan-out at scale#
A topic with 1M subscribers can’t be served by linear iteration. Architectures:
- Tree fan-out — broker delivers to N edge brokers, each delivers to its share of subscribers. Sub-linear cost per publish.
- Pull-based aggregation — subscribers pull on their own clock; broker amortizes by serving recent messages from page cache. Kafka.
- Per-subscriber log — for low-fan-out high-durability use (Pulsar): each subscriber gets its own ledger.
Twitter’s famous “celebrity fan-out” problem (a tweet from a 100M-follower account) is solved with a hybrid: fan-out-on-write for normal accounts, fan-out-on-read for celebrities.
Filtering#
When subscribers want a subset of messages, filtering can happen at three places:
Subscriber-side — broker delivers everything, subscriber drops. Simple, wasteful at scale.Broker-side — broker evaluates a predicate per message per subscriber. SNS filter policies, Google Pub/Sub filters. Saves bandwidth, costs broker CPU.Topic partitioning — pre-route by topic name. `orders.us`, `orders.eu`. Cheapest, requires publishers to know the routing scheme.The third is usually the right default — encode routing into the topic structure.
Ordering guarantees#
Strict global ordering destroys horizontal scaling. Practical alternatives:
- No ordering — fastest. Acceptable for many event-driven uses where events are commutative or carry timestamps.
- Per-key ordering —
hash(key) % partitions; same key always lands on the same partition; partitions are strictly ordered. Default for Kafka. - Per-publisher ordering — Pulsar’s “key-shared” subscription; messages with the same key go to the same consumer in order.
Backpressure and slow subscribers#
A subscriber 10× slower than the publish rate cannot be allowed to back up the topic forever. Strategies:
- Bounded buffer + drop — broker keeps N messages per subscriber; oldest dropped on overflow. Default for ephemeral pub-sub (NATS, Redis).
- Lag monitoring + alert — track consumer lag (committed offset vs latest offset); alert when lag > threshold; let operators scale the subscriber.
- DLQ for poison messages — repeated decode failures land in a DLQ rather than wedging the partition.
- Quota per subscription — managed services (Pub/Sub) cap message age; messages older than
message_retention_durationare dropped.
Retention#
Kafka retains all messages on a topic for retention.ms (default 7 days), independent of consumer state. Any consumer group can rewind to any offset. This is uniquely powerful: you can stand up a new analytics service tomorrow that processes the last week of events.
Pure pub-sub systems (Redis Pub/Sub) retain nothing — disconnect during a publish means the message is gone.
Trade-offs#
Other axes:
- Topic-per-tenant vs shared topic with filtering — topic-per-tenant scales cleanly but explodes the topic count (Kafka starts to struggle past ~10k topics). Shared topic with filter is denser but costs broker CPU per message.
- Push to webhooks vs maintain WebSocket — webhooks are stateless and self-healing; WebSockets are real-time but require server-side connection management.
- Sync vs async fan-out — publish-and-block-until-all-subscribers-ack vs fire-and-forget. The former is rare in practice (too slow); the latter requires explicit ack-tracking infrastructure if needed.
Real-world examples#
- Kafka — consumer groups give both queue and pub-sub semantics. Most modern event-driven architectures default here.
- Google Pub/Sub — fully managed, supports both push (HTTP/2 webhooks) and pull subscriptions, exactly-once support (per-subscription), filters.
- AWS SNS + SQS — SNS fans out to multiple SQS queues; each downstream service consumes from its own queue. The pattern is so common AWS made it a first-class CDK construct.
- Redis Pub/Sub — sub-millisecond, ephemeral, the canonical in-process pub-sub for chat and presence. Discord uses it for presence updates.
- NATS / NATS JetStream — lightweight (single binary), at-most-once core + JetStream for durable streams. Built-in service discovery.
- Apache Pulsar — Kafka-class throughput plus first-class multi-tenancy and geo-replication. Used by Yahoo and Tencent.
- MQTT brokers (Mosquitto, HiveMQ) — pub-sub for IoT; QoS levels mirror at-most-once / at-least-once / exactly-once.
Related building blocks#
- Distributed Messaging Queue — same brokers often, different consumption semantics.
- Distributed Cache — uses pub-sub for cross-region invalidation.
- Distributed Logging — log streams are pub-sub topics with very durable retention.
- Server-Side Error Monitoring — error events publish to a pub-sub topic for downstream consumers.