Event-Driven Architecture Protocols — API Design

What it is#

An event-driven architecture inverts the request/response contract. Instead of the client asking and the server answering, the server emits an event when something happens and any interested party can consume it. There is no single wire-level protocol; there is a family of them, each with different durability, ordering, and delivery guarantees.

Four protocols cover most of the territory:

Webhooks — the server makes an HTTP POST to a URL the consumer registered. One-way, request/response-shaped, fire-and-forget-ish.
Server-Sent Events (SSE) — the consumer holds open an HTTP GET, the server streams events as text/event-stream. One-way, server → client, long-lived.
Apache Kafka (and similar log-shaped brokers) — a durable, ordered, replayable log of events. Consumers track their own position; events are kept for days.
Message queues (RabbitMQ, AWS SQS, Google Cloud Pub/Sub) — a broker holds messages until a consumer acks. No replay; once acked, it’s gone.

Each has a different contract. Webhook delivery is at-least-once with retry; SSE is best-effort with reconnection on Last-Event-ID; Kafka is durable and ordered per-partition; SQS is durable but unordered (FIFO queues exist but at cost). The choice between them is one of the most consequential in an API design.

When to use it#

Reach for event-driven protocols when:

The producer doesn’t know who’s listening. Stripe doesn’t know which webhook URL on which merchant will be hit; Kafka producer doesn’t know which consumer group will read the topic next week. Decoupling producers and consumers is the canonical event-driven win.
The consumer wants liveness, not polls. A chat app polling /messages every second is wasting cycles. SSE or WebSocket pushes when there’s news, sleeps when there isn’t.
Multiple consumers want the same event. A new-order event triggers fulfilment, billing, analytics, and a customer notification — all separate consumers. A pub/sub broker fans this out for free.
You want replay. Kafka keeps events for a week (or a year if you want); a new consumer can replay history. Webhook and SSE typically can’t.

Avoid event-driven protocols when:

You actually want a synchronous reply. Request/response is simpler; if the caller needs the answer immediately, don’t bounce through a queue.
Strong consistency is required. Eventual-consistency is the event-driven default. A bank transfer that has to debit then credit atomically is a transaction, not two events.
Operational simplicity matters more than throughput. Kafka is a system to operate. RabbitMQ is a system to operate. If you can solve it with a POST and a retry, that’s cheaper.

How it works#

Webhooks — HTTP `POST` from server to consumer#

The simplest event-driven protocol. The consumer registers a URL; the server POSTs to it when an event fires.

POST /webhooks/stripe HTTP/1.1
Host: api.merchant.com
Content-Type: application/json
Stripe-Signature: t=1717075200,v1=hexsignature...
User-Agent: Stripe/1.0

{
  "id": "evt_1NkXyzABC",
  "type": "payment_intent.succeeded",
  "created": 1717075200,
  "data": { "object": { "id": "pi_3NkXyz", "amount": 4999, ... } }
}

The delivery contract is at-least-once with retry on non-2xx. Stripe retries failed webhooks with exponential backoff over 3 days. The consumer must:

Respond with a 2xx promptly (Stripe’s deadline is 30 seconds, GitHub’s is 10 seconds; whatever the producer’s docs say).
Verify the signature (HMAC-SHA256 over the raw body, keyed by a secret the consumer shared at registration). Without this, anyone can POST to the public webhook URL.
Deduplicate by event ID. Retries arrive with the same id. Storing seen IDs (Redis with TTL = the producer’s retry window) avoids double-processing.
Handle out-of-order delivery. Webhooks have no per-recipient ordering guarantee; payment_intent.created and payment_intent.succeeded can arrive in either order. Idempotent state machines, not sequential processing.

Slack, GitHub, Twilio, Auth0, and every payments processor ship webhooks. The pattern is universal; the details (signature scheme, retry window, ordering claim) vary per provider.

Server-Sent Events — one-way streaming over HTTP#

SSE is the lighter-weight cousin of WebSocket: one-way (server → client), HTTP-native, text-only. The client opens a GET with Accept: text/event-stream; the server holds the connection open and writes events.

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

id: 14721
event: message
data: {"user": "alice", "text": "hello"}

id: 14722
event: typing
data: {"user": "bob"}

id: 14723
event: message
data: {"user": "bob", "text": "hi"}

Each event is id:, event:, data: lines followed by a blank line. The browser’s EventSource API handles reconnection automatically — on a dropped connection, it reconnects and sends the last seen id as the Last-Event-ID header, letting the server resume.

GET /events HTTP/1.1
Accept: text/event-stream
Last-Event-ID: 14723

SSE is the right tool for: LLM token streaming (OpenAI, Anthropic), live notifications, dashboards, chat read receipts (one-way is enough), public-event tails (GitHub events). Not for: full-duplex chat, binary frames, anything requiring client-to-server pushes (use WebSocket).

Apache Kafka — the durable log#

Kafka is a different model: a distributed, replicated, append-only log with partitioned topics. Producers write to a topic; consumers read at their own pace and track their own offset.

   Producer            Topic "orders" (partitions p0, p1, p2)
       │
       │ produce(key=order_42, value=...)
       │─────────────────────────────────►│ p1: [..., 1023, 1024, 1025, NEW]
       │                                  │ p0: [..., 487, 488]
       │                                  │ p2: [..., 612, 613, 614]
                                          │
   Consumer group "fulfilment"            │
       │  consume(p1, offset=1023)        │
       │─────────────────────────────────◄│
       │  consume(p1, offset=1024)        │
       │─────────────────────────────────◄│
       │  ... commits offset back         │
       │                                  │
   Consumer group "analytics"             │
       │  reads from offset=0 (replay)    │
       │─────────────────────────────────◄│

Key contracts:

Ordering is per-partition, not per-topic. Messages with the same key (e.g. order_42) go to the same partition and are read in order. Messages with different keys can arrive in any order across the topic.
Durability is replicated. Each partition is replicated to N brokers; a write isn’t acked until M replicas have it (configurable).
Replay is built-in. A new consumer can start from offset 0 and re-read everything still in retention (7 days default, configurable up to forever).
Delivery is at-least-once by default, exactly-once with transactions. The exactly-once mode (idempotent producer + transactional consumer) has overhead; most teams stay at at-least-once and dedupe at the consumer.

Kafka is the right tool when: events are the source of truth (event sourcing), replay is needed, multiple consumer groups will read the same topic, throughput is high (>10k events/sec). Wrong tool when: small scale, simple notification, no replay need (use a queue or webhook instead).

Message queues — RabbitMQ, SQS, Cloud Pub/Sub#

Queues are simpler than logs: a broker holds a message until a consumer acks. Once acked, it’s gone.

   Producer       Queue "fulfilment"     Consumer
       │ send(msg)    [msg1, msg2, msg3]      │
       │─────────────►│                       │
       │              │  deliver(msg1)        │
       │              │──────────────────────►│
       │              │                       │ process(msg1)
       │              │  ack(msg1)            │
       │              │◄──────────────────────│
       │              │  msg1 removed         │
       │              │  [msg2, msg3]         │

SQS, RabbitMQ, and Cloud Pub/Sub all expose this shape with variations:

Visibility timeout — when a message is delivered, it’s hidden for N seconds. If the consumer acks within N, it’s removed; if not, it reappears for redelivery.
Dead-letter queue (DLQ) — after K redeliveries, the message moves to a separate queue for inspection. Prevents poison-message loops.
Fan-out — SQS topics + queues (or RabbitMQ exchanges + queues) let one publish hit multiple downstream queues, one per consumer.
FIFO mode — strict ordering with deduplication, at lower throughput. SQS FIFO is capped at 3000 msg/sec per group; standard SQS has no cap.

Queues are right for: work queues (process this asynchronously), email/SMS dispatch, image processing, background jobs. Wrong for: event replay, multiple consumer groups reading the same stream (Kafka is the right tool there).

Consuming a webhook — three languages#

Verifying a Stripe-style HMAC-signed webhook in Python, Go, and Node:

import hmac
import hashlib
import time
from flask import Flask, request, abort

app = Flask(__name__)
WEBHOOK_SECRET = "whsec_..."
TOLERANCE_SECONDS = 300
seen = {}  # in production use Redis

def verify_signature(body: bytes, header: str) -> None:
    parts = dict(p.split("=", 1) for p in header.split(","))
    timestamp = int(parts["t"])
    if abs(time.time() - timestamp) > TOLERANCE_SECONDS:
        abort(400, "stale signature")
    expected = hmac.new(
        WEBHOOK_SECRET.encode(),
        f"{timestamp}.".encode() + body,
        hashlib.sha256,
    ).hexdigest()
    if not hmac.compare_digest(expected, parts["v1"]):
        abort(400, "bad signature")

@app.post("/webhooks/stripe")
def webhook():
    body = request.get_data()
    verify_signature(body, request.headers["Stripe-Signature"])
    event = request.get_json()
    if event["id"] in seen:
        return "", 200  # dedupe
    seen[event["id"]] = True
    # process the event idempotently
    return "", 200

package main

import (
    "crypto/hmac"
    "crypto/sha256"
    "encoding/hex"
    "encoding/json"
    "io"
    "net/http"
    "strings"
    "time"
)

const webhookSecret = "whsec_..."
const toleranceSeconds = 300

var seen = map[string]bool{} // in production use Redis

func verifySignature(body []byte, header string) bool {
    parts := map[string]string{}
    for _, p := range strings.Split(header, ",") {
        kv := strings.SplitN(p, "=", 2)
        if len(kv) == 2 { parts[kv[0]] = kv[1] }
    }
    ts, _ := time.Parse(time.RFC3339, parts["t"])
    if time.Since(ts) > toleranceSeconds*time.Second { return false }
    mac := hmac.New(sha256.New, []byte(webhookSecret))
    io.WriteString(mac, parts["t"]+".")
    mac.Write(body)
    return hmac.Equal([]byte(hex.EncodeToString(mac.Sum(nil))), []byte(parts["v1"]))
}

func webhook(w http.ResponseWriter, r *http.Request) {
    body, _ := io.ReadAll(r.Body)
    if !verifySignature(body, r.Header.Get("Stripe-Signature")) {
        http.Error(w, "bad sig", 400)
        return
    }
    var event struct{ ID string `json:"id"` }
    json.Unmarshal(body, &event)
    if seen[event.ID] { w.WriteHeader(200); return }
    seen[event.ID] = true
    // process idempotently
    w.WriteHeader(200)
}

import express from "express";
import crypto from "crypto";

const app = express();
const WEBHOOK_SECRET = "whsec_...";
const TOLERANCE_SECONDS = 300;
const seen = new Set(); // in production use Redis

function verifySignature(body, header) {
  const parts = Object.fromEntries(header.split(",").map(p => p.split("=")));
  const ts = Number(parts.t);
  if (Math.abs(Date.now() / 1000 - ts) > TOLERANCE_SECONDS) return false;
  const expected = crypto
    .createHmac("sha256", WEBHOOK_SECRET)
    .update(`${ts}.${body}`)
    .digest("hex");
  return crypto.timingSafeEqual(Buffer.from(expected), Buffer.from(parts.v1));
}

app.post(
  "/webhooks/stripe",
  express.raw({ type: "application/json" }),
  (req, res) => {
    const body = req.body.toString();
    if (!verifySignature(body, req.headers["stripe-signature"])) {
      return res.status(400).send("bad sig");
    }
    const event = JSON.parse(body);
    if (seen.has(event.id)) return res.status(200).end();
    seen.add(event.id);
    // process idempotently
    res.status(200).end();
  }
);

Variants#

Protocol	Direction	Durability	Ordering	Replay	Best for
Webhook	Server → consumer (HTTP POST)	Producer retries 1-3 days	None	None	External integrations, partner notifications
SSE	Server → client (long-lived GET)	None (best-effort)	In-order on one stream	Via `Last-Event-ID`	Live dashboards, LLM streaming, chat read
WebSocket	Bidirectional	None	In-order	None	Real-time bidirectional (chat, games)
Kafka	Producer → log → consumers	Replicated, retention 7d–∞	Per-partition	Yes (offset-based)	Event sourcing, multi-consumer fan-out, replay
Message queue (SQS, RabbitMQ)	Producer → queue → one consumer	Until acked	None (FIFO mode = yes)	None	Work queues, async jobs
MQTT	Pub/sub over TCP, low overhead	Optional	Per-topic	None	IoT, mobile (low bandwidth)

Trade-offs#

What event-driven protocols give you:

Decoupling. Producers and consumers evolve independently.
Liveness without polling. Push is more efficient than pull when the event rate is low and the latency target is tight.
Fan-out. One event, many consumers, each at its own pace.
Replay (Kafka specifically). Reprocess history when a bug forces it.

What event-driven protocols cost you:

Operational complexity. A Kafka cluster is a system to operate; a webhook receiver is a system to monitor.
Debugging is harder. No per-event HTTP log line; you trace across producer, broker, and consumer.
Ordering and consistency are weaker by default. Most event systems are at-least-once; consumers must be idempotent. Strong-ordering modes exist but at cost.
Backpressure is the producer’s problem to think about. A slow consumer with no DLQ is a queue that grows until the broker tips over.

Common pitfalls#

No signature verification on webhooks. Public URL + no HMAC = anyone can fake events. Stripe, GitHub, Twilio all ship HMAC; use it.
Not deduplicating by event ID. Retries are guaranteed; non-idempotent consumers double-charge / double-email / double-ship.
Treating webhook delivery as synchronous. Respond with 202 Accepted and process asynchronously. A slow consumer chain causes the producer to mark the webhook as failed and retry the whole train.
Assuming webhook order. Events are reordered in flight; build state machines that tolerate any order.
Using Kafka for a simple notification. A 3-broker cluster to deliver one email-sent event is over-investing. A queue or a webhook is fine.
Using a queue for event sourcing. Once acked, the message is gone. If you want history, you want a log (Kafka), not a queue.
No DLQ. Poison messages loop forever, consuming worker capacity. Every queue needs a DLQ and an alert on it.
SSE through a CDN that buffers. Some CDNs buffer responses for compression; the stream arrives in chunks of N. Disable buffering on the SSE path (X-Accel-Buffering: no for nginx).

WebSockets — Bidirectional Streaming — bidirectional version of SSE; same upgrade-handshake idea, full-duplex frames.
Data Fetching Patterns — push vs pull, paginated vs streamed; the policy that picks event-driven over polling.
Design a Pub-Sub Service API — designing a Kafka-shaped or SQS-shaped pub/sub API from scratch.
Managing Retries — webhook retry policies and DLQ design; the same exponential-backoff playbook.
The Role of Idempotency in API Design — at-least-once delivery requires idempotent consumers; same key-store pattern.