Distributed Messaging Queue — System Design

Use cases#

A messaging queue lets a producer hand off a unit of work to one or more consumers without coordinating directly. It decouples in time, in throughput, and in failure mode:

Asynchronous job processing — email send, image resize, billing reconciliation. The web request returns 202; the worker fleet drains the queue at its own pace.
Buffering bursts — Black Friday triples your write rate; the queue absorbs the spike while downstream services autoscale up.
Service decoupling — order service publishes “order placed”; inventory, billing, shipping, and analytics each consume independently. Each can be deployed, scaled, or fail without affecting the others.
Workflow / pipeline stages — fan a payload through extract → transform → enrich → load, with each stage backed by its own queue and consumer group.

This building block focuses on queues (single-consumer-per-message semantics). For broadcast / multi-subscriber patterns, see Pub-Sub.

Functional requirements#

Producers enqueue messages with optional metadata (key, priority, delay).
Consumers receive messages, process them, and ack on success or nack on failure.
Failed messages retry; after N retries, move to a dead-letter queue (DLQ).
Configurable retention (consume-then-delete vs replay-from-offset).
Optional ordering guarantee within a partition / queue / key.

Non-functional requirements#

Latency: enqueue-to-dequeue p99 under 100 ms; sub-10 ms for low-latency systems.
Throughput: Kafka pushes past 1M messages/sec per broker; SQS scales horizontally without operator intervention.
Durability: writes survive broker failure — usually acks=all plus a replication factor of 3.
Availability: 99.95% minimum for the queue itself; consumers may go down but the queue must stay up.
Ordering: per-partition or per-key, almost never globally — global ordering kills horizontal scalability.

High-level design#

                              ┌── partition 0 ──> consumer A1
producers ──> broker(s) ──────┼── partition 1 ──> consumer A2  (group A)
                              ├── partition 2 ──> consumer A3
                              └── partition 3 ──> consumer A4

                              same partitions ─> consumer B1 / B2  (group B, independent offsets)

      ┌── replicated WAL ────┐
broker│  msg → page cache    │  partition leader handles writes,
      │  → fsync (configurable)  followers replicate
      └──────────────────────┘

      DLQ ←── after N failed retries

The broker stores messages as an append-only log. Producers append; consumers track an offset and read forward. Partitions are the unit of horizontal scaling and the unit of ordering — within one partition, messages are strictly ordered; across partitions, no guarantee.

Detailed design#

Delivery semantics#

Three options. Pick exactly one per use case:

At-most-once    — ack on send, don't retry. Messages may be lost. Used for telemetry, metrics.
At-least-once   — ack on consumer success, retry on failure. Duplicates possible. Default choice.
Exactly-once    — at-least-once + consumer-side idempotency, OR transactional brokers. Hard.

True end-to-end exactly-once doesn’t exist over a network — the FLP impossibility result and the two-generals problem rule it out. What systems like Kafka 0.11+ provide is exactly-once-within-a-Kafka-transaction: read from a Kafka topic, write to another Kafka topic, commit both atomically. As soon as the consumer side leaves Kafka (e.g. writes to Postgres), the application must implement idempotency itself.

Partitions and ordering#

A message’s partition is chosen by hash(key) mod num_partitions. All messages with the same key land on the same partition, in order. This is how you get “messages for user 42 are processed in order” while still scaling: just key by user_id.

Trade-off: changing partition count rehashes keys (with hash(key) mod N — Kafka doesn’t auto-rebalance). Pre-provision generous partition counts.

Consumer groups#

A consumer group is a set of consumer instances cooperating to drain one topic. The broker assigns partitions to members; if you have 10 partitions and 5 consumers, each gets 2 partitions. Add a 6th consumer → rebalance, each gets fewer. Add an 11th consumer → it’s idle (more consumers than partitions = wasted capacity).

Multiple consumer groups on the same topic read independently — each has its own committed offset. This is how the same event stream feeds analytics, billing, and notifications without re-reading from disk.

Retries and dead-letter queues#

attempt = 0
while attempt < max_retries:
    try:
        process(msg)
        ack(msg)
        break
    except RetryableError:
        attempt += 1
        sleep(backoff(attempt))      // exponential, jittered
        continue
    except NonRetryableError:
        send_to_dlq(msg)
        ack(msg)                     // ack so the queue moves on
        break
else:
    send_to_dlq(msg)
    ack(msg)

Backoff with jitter is non-negotiable. Without jitter, a transient failure spike causes synchronized retries that recreate the failure. With jitter, the retries spread over time and the downstream has a chance to recover.

DLQ inspection: failed messages get routed to a separate topic / queue. An on-call human reviews, fixes, and either replays them or drops them. Some teams build “redrive” tooling to bulk-replay DLQ contents after fixing the bug.

Visibility timeout (SQS-style)#

When a consumer receives a message, the broker hides it for visibility_timeout seconds. If the consumer crashes without acking, the message becomes visible again and another consumer picks it up.

Tune this: too short and slow processors get duplicates; too long and crashed consumers wedge messages.

Push vs pull#

Pull (Kafka, SQS)      — consumer asks for next batch. Backpressure for free.
Push (RabbitMQ, NATS)  — broker pushes to consumer. Lower latency, harder to backpressure.

Kafka’s pull model is the reason it scales to 1M+ msg/sec — the consumer controls flow. RabbitMQ’s push is faster for low-volume work-queues but requires prefetch tuning to avoid overload.

Brokers under the hood#

Kafka’s secret: messages are an append-only log, written to OS page cache, fsync’d at configurable intervals. Consumers read sequentially with sendfile() — zero-copy from page cache to socket. This is why Kafka can saturate a 10 Gb NIC with a single broker.

Replication: each partition has a leader and N-1 followers. acks=all means the leader waits for in-sync followers before acking. Failover: a follower takes over within seconds via ZooKeeper / KRaft consensus.

Trade-offs#

Kafka (log-based) — replay any time, multiple consumer groups, massive throughput. Operationally heavier (ZooKeeper / KRaft, JVM tuning, partition planning). Best for event streams and analytics pipelines.

SQS / RabbitMQ (queue-based) — simpler operations, per-message ack, visibility timeouts. Throughput per-queue is lower; no built-in replay (SQS) or limited replay (RabbitMQ). Best for traditional work queues.

Other axes:

In-broker retention vs consume-and-delete — Kafka retains for days/weeks; SQS deletes on ack. Retention enables replay (priceless for bug recovery) but multiplies storage cost.
Single-consumer-per-message (queue) vs multi-consumer (pub-sub) — different semantics; both are useful. Kafka does both via consumer groups; RabbitMQ separates queues from exchanges.
At-least-once + idempotency vs transactional exactly-once — idempotency is more portable; transactional EOS works only within one broker ecosystem.

Real-world examples#

Kafka — LinkedIn invented it (2011), now ubiquitous. Discord runs Kafka with thousands of topics for fan-out. Uber processes trillions of events/day through Kafka.
AWS SQS — managed queue, two flavors: standard (at-least-once, best-effort order) and FIFO (exactly-once-ish, strict order, lower throughput).
RabbitMQ — AMQP broker, strong routing primitives (exchanges, bindings, headers). Still the workhorse in many enterprise shops.
Apache Pulsar — Kafka-class throughput with built-in geo-replication and tiered storage to S3. Used by Yahoo (its origin), Tencent, Splunk.
NATS / NATS JetStream — lightweight (single binary), at-most-once core + JetStream for at-least-once persistence.
Google Pub/Sub, Azure Service Bus — managed equivalents with regional / global semantics.

Pub-Sub — broadcast / multi-subscriber sibling; same brokers, different semantics.
Distributed Logging — log pipelines often run through Kafka or similar.
Distributed Task Scheduler — scheduled / delayed task execution layered on top of a queue.
Sequencer — used to assign idempotency keys to messages.