Publish / Subscribe

Topic-based fan-out, ordering guarantees, filtering, retention, and the gap between pub-sub and queues.

Building Block Intermediate
7 min read
messaging pub-sub event-driven
Companies this resembles: Kafka · Google Pub/Sub · AWS SNS · Redis · NATS

Use cases#

Publish/subscribe broadcasts each message to every interested subscriber. Where a queue routes each message to exactly one consumer, pub-sub fans the same message out to N. The dominant use cases:

  • Event-driven architectureOrderPlaced is published once; inventory, billing, analytics, and customer-comms each subscribe independently.
  • Notification fan-out — a tweet from a celebrity goes to every follower’s feed-builder service; a Slack message to every connected client.
  • Cache invalidation — a write to the primary store publishes “key K is dirty”; every cache layer subscribes and evicts.
  • Real-time updates — chat, presence, live dashboards, multiplayer game state.

The boundary with Distributed Messaging Queue: queues are 1-of-N consumption; pub-sub is N-of-N. Kafka does both via consumer groups; SNS/SQS pairs SNS (pub-sub) with SQS (queue) to combine semantics.

Functional requirements#

  • Publishers send messages to topics (or channels, subjects).
  • Subscribers register interest in topics; each subscriber receives every message published while it was subscribed.
  • Optional: pattern matching (order.*), content filtering (subscribers receive only messages matching predicates).
  • Optional: durable subscriptions that survive subscriber downtime (offsets, replay).
  • Optional: ordering within a partition / key.

Non-functional requirements#

  • Latency: publish-to-deliver p99 under 100 ms for most uses; under 10 ms for real-time UX (chat, presence).
  • Fan-out: a single publish must reach 1000s-1M+ subscribers. Architecture has to assume read fan-out >> write.
  • Throughput: a busy topic can see millions of messages/sec aggregated across subscribers.
  • Backpressure: a slow subscriber must not slow down the publisher or other subscribers. Per-subscriber buffering or dropping is required.

High-level design#

┌── subscriber A (durable, offset 1234)
publishers ──> topic / partition ──┼── subscriber B (durable, offset 1230)
log │
├── subscriber C (ephemeral, latest only)
└── subscriber D (filtered: type=urgent)
each subscriber tracks its own progress; broker does not retain "delivered" state per message

Two topology styles:

Broker-based (Kafka, Pulsar, Pub/Sub) — central broker stores messages, subscribers pull/push from it.
Mesh / broker-less (NATS core, Redis Pub/Sub) — broker just forwards; nothing is retained.

Broker-based dominates the durable-subscription space; broker-less dominates ephemeral real-time fan-out.

Detailed design#

Push vs pull delivery#

Push — broker sends to subscriber endpoint (HTTP webhook, WebSocket, gRPC stream).
Used by SNS, Google Pub/Sub (push subscriptions), most chat systems.
Pro: low latency, broker controls fan-out.
Con: slow subscriber → broker has to drop or buffer, retries are tricky.
Pull — subscriber asks broker for next batch.
Used by Kafka, Pulsar, SQS, Pub/Sub (pull subscriptions).
Pro: backpressure for free, subscriber controls flow.
Con: subscriber needs to be running and polling; cold start hurts latency.

Most production systems offer both. Use push when you need real-time and own the subscriber endpoint; use pull when subscriber load is uneven and you want it self-paced.

Durable vs ephemeral subscriptions#

Ephemeral — subscriber connects, receives messages from "now"; on disconnect, history is lost.
Redis Pub/Sub, NATS core, WebSocket-based UIs.
Durable — subscriber has a persistent identifier; on reconnect, broker replays from the last
committed offset. Kafka consumer groups, Pulsar subscriptions, Pub/Sub subscriptions.

The durable mode is what makes pub-sub useful for service architectures — a service can be redeployed, crash, or pause, and pick up exactly where it left off.

Fan-out at scale#

A topic with 1M subscribers can’t be served by linear iteration. Architectures:

  • Tree fan-out — broker delivers to N edge brokers, each delivers to its share of subscribers. Sub-linear cost per publish.
  • Pull-based aggregation — subscribers pull on their own clock; broker amortizes by serving recent messages from page cache. Kafka.
  • Per-subscriber log — for low-fan-out high-durability use (Pulsar): each subscriber gets its own ledger.

Twitter’s famous “celebrity fan-out” problem (a tweet from a 100M-follower account) is solved with a hybrid: fan-out-on-write for normal accounts, fan-out-on-read for celebrities.

Filtering#

When subscribers want a subset of messages, filtering can happen at three places:

Subscriber-side — broker delivers everything, subscriber drops. Simple, wasteful at scale.
Broker-side — broker evaluates a predicate per message per subscriber. SNS filter policies,
Google Pub/Sub filters. Saves bandwidth, costs broker CPU.
Topic partitioning — pre-route by topic name. `orders.us`, `orders.eu`. Cheapest, requires
publishers to know the routing scheme.

The third is usually the right default — encode routing into the topic structure.

Ordering guarantees#

Strict global ordering destroys horizontal scaling. Practical alternatives:

  • No ordering — fastest. Acceptable for many event-driven uses where events are commutative or carry timestamps.
  • Per-key orderinghash(key) % partitions; same key always lands on the same partition; partitions are strictly ordered. Default for Kafka.
  • Per-publisher ordering — Pulsar’s “key-shared” subscription; messages with the same key go to the same consumer in order.

Backpressure and slow subscribers#

A subscriber 10× slower than the publish rate cannot be allowed to back up the topic forever. Strategies:

  • Bounded buffer + drop — broker keeps N messages per subscriber; oldest dropped on overflow. Default for ephemeral pub-sub (NATS, Redis).
  • Lag monitoring + alert — track consumer lag (committed offset vs latest offset); alert when lag > threshold; let operators scale the subscriber.
  • DLQ for poison messages — repeated decode failures land in a DLQ rather than wedging the partition.
  • Quota per subscription — managed services (Pub/Sub) cap message age; messages older than message_retention_duration are dropped.

Retention#

Kafka retains all messages on a topic for retention.ms (default 7 days), independent of consumer state. Any consumer group can rewind to any offset. This is uniquely powerful: you can stand up a new analytics service tomorrow that processes the last week of events.

Pure pub-sub systems (Redis Pub/Sub) retain nothing — disconnect during a publish means the message is gone.

Trade-offs#

Kafka-style log-based pub-sub — retention, multiple consumer groups, replay, strict per-partition order. Operationally heavier; broker is stateful.
Broker-less / fire-and-forget (Redis, NATS core) — sub-millisecond latency, trivial ops, ephemeral. Disconnect = data loss; not suitable for billing or anything that must not miss an event.

Other axes:

  • Topic-per-tenant vs shared topic with filtering — topic-per-tenant scales cleanly but explodes the topic count (Kafka starts to struggle past ~10k topics). Shared topic with filter is denser but costs broker CPU per message.
  • Push to webhooks vs maintain WebSocket — webhooks are stateless and self-healing; WebSockets are real-time but require server-side connection management.
  • Sync vs async fan-out — publish-and-block-until-all-subscribers-ack vs fire-and-forget. The former is rare in practice (too slow); the latter requires explicit ack-tracking infrastructure if needed.

Real-world examples#

  • Kafka — consumer groups give both queue and pub-sub semantics. Most modern event-driven architectures default here.
  • Google Pub/Sub — fully managed, supports both push (HTTP/2 webhooks) and pull subscriptions, exactly-once support (per-subscription), filters.
  • AWS SNS + SQS — SNS fans out to multiple SQS queues; each downstream service consumes from its own queue. The pattern is so common AWS made it a first-class CDK construct.
  • Redis Pub/Sub — sub-millisecond, ephemeral, the canonical in-process pub-sub for chat and presence. Discord uses it for presence updates.
  • NATS / NATS JetStream — lightweight (single binary), at-most-once core + JetStream for durable streams. Built-in service discovery.
  • Apache Pulsar — Kafka-class throughput plus first-class multi-tenancy and geo-replication. Used by Yahoo and Tencent.
  • MQTT brokers (Mosquitto, HiveMQ) — pub-sub for IoT; QoS levels mirror at-most-once / at-least-once / exactly-once.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.