Design a Pub-Sub Service API
Topics, subscribers, fan-out, at-least-once vs exactly-once. The asynchronous backbone of every modern microservice mesh.
Context#
A pub-sub service is the asynchronous backbone of every microservice mesh built since 2010. AWS SNS+SQS, Google Cloud Pub/Sub, Azure Service Bus, NATS, Apache Pulsar, and (with caveats) Apache Kafka all expose variations on the same shape. One producer emits an event to a topic; N consumers receive it. The producer doesn’t know who the consumers are; the consumers don’t know who produced it. This decoupling is the whole point — services evolve independently, scale independently, fail independently.
What makes this a strong interview question is that the API surface looks small (publish, subscribe, pull) but the semantics are unusually rich:
- Delivery guarantees: at-most-once, at-least-once, exactly-once — each has a different API consequence.
- Push vs pull: a
delivery_urlon the subscription versus the consumer callingpullon a schedule. - Fan-out: 1 topic → N subscriptions; each subscription is an independent queue.
- Ordering: per-topic, per-key, or none — pick one and live with the trade.
- Dead-letter queues: where messages go when they can’t be processed after N retries.
- Backpressure: what happens when consumers can’t keep up.
The interviewer’s hidden objectives:
- Can you explain the difference between a topic and a subscription correctly? (Topics are stateless; subscriptions are queues.)
- Can you defend at-least-once as the default and explain why exactly-once is hard?
- Do you specify acknowledgement as a first-class operation, not a happy-path side effect?
- Can you sketch the fan-out model at scale — does each subscription get its own copy?
- Can you write the dead-letter and retry policy as an explicit config, not folklore?
Reference: Google Cloud Pub/Sub for the API shape; AWS SNS+SQS for the topic-fans-into-queues mental model; Kafka for ordering and partition guarantees.
Requirements (functional and non-functional)#
Functional — in scope:
- Create / delete topics.
- Publish a message (or a batch) to a topic.
- Create / delete subscriptions on a topic (push or pull mode).
- Pull messages from a pull-subscription.
- Acknowledge messages (so they’re removed from the queue).
- Configure dead-letter policy per subscription.
- Configure per-subscription filtering (by message attributes).
Functional — out of scope:
- Stream-replay / re-consumption — Kafka’s specialty; we are a queue-shaped pub-sub, not a log.
- Schema registry — assume the producer and consumer agree out-of-band.
- Encryption-at-rest — assume the underlying storage handles it.
- Multi-region replication — separate axis; v1 is single-region.
Non-functional:
- Latency: publish
<= 50 ms p95; end-to-end (publish → delivered)<= 1 s p95. - Throughput: 1 M messages/s aggregate; 10k messages/s per topic.
- Availability: 99.95% on the publish path (writes must not fail); 99.9% on the pull/push path.
- Delivery guarantee: at-least-once by default. Idempotent consumers achieve effectively-once.
- Durability: 11 nines once
publishreturns 200. - Maximum message size: 1 MB (larger payloads use
file-serviceand pass a reference). - Retention: up to 7 days unacknowledged; configurable.
Use case diagram#
┌──────────────┐ ┌──────────────┐ │ Producer │ │ Consumer │ └──────┬───────┘ └──────┬───────┘ │ │ ▼ ▼ [publish event] [pull messages] [batch publish] [acknowledge] [confirm receipt] [reject (nack)] │ │ └──────────────┬──────────────────┘ ▼ ┌───────────────────────────┐ │ Pub-Sub Service API │ └───────────────────────────┘ │ ┌──────────┴──────────┐ ▼ ▼ [Admin] [Async push] create/delete to webhook URL topic/subscriptionThree actors: producer, consumer, admin. Both producer and consumer talk only to the API; they never know each other.
Class diagram#
┌───────────────────────┐ │ PubSubService │ ├───────────────────────┤ │ createTopic(req) │ │ publish(t, msgs) │ │ createSubscription(r) │ │ pull(sub, max) │ │ ack(sub, ids) │ │ nack(sub, ids) │ └──────────┬────────────┘ │ owns ▼ ┌───────────────────────┐ ┌─────────────────────┐ │ Topic │ 1 ─── * │ Subscription │ ├───────────────────────┤ ├─────────────────────┤ │ name │ │ name │ │ created_at │ │ topic_name │ │ partition_count │ │ mode (push|pull) │ │ retention_seconds │ │ delivery_url? │ └──────────┬────────────┘ │ ack_deadline_sec │ │ publishes │ filter? │ ▼ │ dead_letter? │ ┌───────────────────────┐ └──────────┬──────────┘ │ Message │ │ enqueues into ├───────────────────────┤ ▼ │ id │ ┌─────────────────────┐ │ topic │ │ DeliveryAttempt │ │ ordering_key? │ ├─────────────────────┤ │ attributes (map) │ │ subscription │ │ data (bytes) │ │ message_id │ │ publish_time │ │ attempt_count │ └───────────────────────┘ │ lease_expires_at │ │ status │ └─────────────────────┘Topic is stateless: messages flow through but aren’t owned by it. Subscription is where state lives — each subscription is conceptually a queue, with its own backlog. A message arrives at a topic and is copied into the queue of every matching subscription. Within a subscription, DeliveryAttempt tracks per-message lease and retry state.
Sequence diagram (key flows)#
The publish + fan-out flow:
Producer PubSubAPI Topic Sub-A Sub-B │ POST /topics/orders:publish │ │ │ │──────────────────►│ │ │ │ │ │ validate│ │ │ │ │ persist │ │ │ │ │────────►│ │ │ │ │ matches subscription filters │ │ │ enqueue into Sub-A queue │ │ │─────────────────────────►│ │ │ │ enqueue into Sub-B queue │ │ │─────────────────────────────────────────►│ │ 200 + msg_id │ │ │ │ │◄──────────────────│ │ │ │The pull + ack flow (pull-mode subscription):
Consumer PubSubAPI SubscriptionQueue │ POST /subscriptions/{s}/pull?max=10 │──────────────────►│ │ │ │ fetch up to 10│ │ │ lease them for│ ack_deadline_seconds │ │──────────────►│ │ │ 10 messages │ │ │◄──────────────│ │ 10 messages + ack_ids │ │◄──────────────────│ │ │ │ │ ... consumer processes ... │ │ │ │ POST /subscriptions/{s}/ack │ │ ack_ids: [a, b, c, ...] │ │──────────────────►│ │ │ │ delete acked │ │ │──────────────►│ │ 204 No Content │ │ │◄──────────────────│ │The push flow inverts the direction — the service POSTs to the consumer’s delivery_url:
PubSubService Consumer (webhook) │ POST <delivery_url> │ │ body: {message, ack_id} │ │─────────────────────────────────────►│ │ │ process │ HTTP 200 OK ◄───────────────────────│ │ (200 == ack; non-2xx == nack) │The nack + retry + dead-letter flow:
ack fails after N attempts ┌─────────────────────────┐ │ Subscription queue │ └─────────┬───────────────┘ │ message exhausts retry policy ▼ ┌─────────────────────────┐ │ Dead-Letter Subscription│ └─────────┬───────────────┘ │ operator inspects, replays, or drops ▼ [out of scope]Activity diagram (for non-trivial state)#
DeliveryAttempt.status is the meaningful state machine — what does a message look like during its retry-loop life?
[message published] │ ▼ ┌──────────┐ │ Queued │ └────┬─────┘ │ pull / push ▼ ┌──────────┐ │ Leased │ (consumer holds it for └────┬─────┘ ack_deadline_seconds) │ ┌───────────┼──────────────────┐ ack │ │ lease │ nack / │ │ expires │ explicit nack ▼ ▼ ▼ ┌───────┐ ┌──────────┐ ┌──────────┐ │ Acked │ │ Re-Queued│ │ Re-Queued│ │(done) │ └────┬─────┘ └────┬─────┘ └───────┘ │ │ │ retry │ retry ▼ ▼ ┌──────────────────────────┐ │ attempt_count++ │ └────────────┬─────────────┘ │ max attempts │ exceeded ▼ ┌────────────────────┐ │ Dead-Letter │ │ Subscription │ └────────────────────┘The lease mechanism is the load-bearing invariant: when a consumer pulls a message, it gets a lease for ack_deadline_seconds (default 30 s). If the consumer crashes, the lease expires and the message is redelivered. This is why at-least-once is the default and the consumer must be idempotent. Exactly-once requires a deduplication store keyed on message ID — pushed to the consumer’s side, not the broker’s.
API implementation#
Endpoint catalogue#
| Method | Path | Purpose |
|---|---|---|
POST | /v1/topics | Create topic |
DELETE | /v1/topics/{name} | Delete topic |
POST | /v1/topics/{name}:publish | Publish 1..N messages |
POST | /v1/subscriptions | Create subscription (push or pull) |
DELETE | /v1/subscriptions/{name} | Delete subscription |
POST | /v1/subscriptions/{name}:pull | Pull up to max_messages (pull only) |
POST | /v1/subscriptions/{name}:ack | Acknowledge by ack_ids |
POST | /v1/subscriptions/{name}:nack | Negative-acknowledge (immediate redelivery) |
PATCH | /v1/subscriptions/{name} | Update push URL, deadline, filter, DLQ |
We use resource-action verbs (:publish, :pull, :ack) because the operations aren’t naturally CRUD. This matches GCP’s API style guide and pub-sub-specific verbs across the industry.
OpenAPI schema (excerpt)#
paths: /v1/topics/{name}:publish: post: operationId: publish parameters: - { name: name, in: path, required: true, schema: { type: string } } requestBody: required: true content: application/json: schema: type: object required: [messages] properties: messages: type: array maxItems: 1000 items: type: object required: [data] properties: data: { type: string, format: byte, maxLength: 1048576 } attributes: type: object additionalProperties: { type: string } ordering_key: type: string nullable: true maxLength: 1024 responses: '200': description: Published content: application/json: schema: type: object required: [message_ids] properties: message_ids: type: array items: { type: string } '413': { description: Batch too large } /v1/subscriptions/{name}:pull: post: operationId: pull parameters: - { name: name, in: path, required: true, schema: { type: string } } requestBody: required: false content: application/json: schema: type: object properties: max_messages: { type: integer, minimum: 1, maximum: 1000, default: 10 } return_immediately: { type: boolean, default: false } responses: '200': description: Up to max_messages messages content: application/json: schema: type: object properties: received_messages: type: array items: type: object required: [ack_id, message] properties: ack_id: { type: string } message: { $ref: '#/components/schemas/Message' } /v1/subscriptions/{name}:ack: post: operationId: acknowledge parameters: - { name: name, in: path, required: true, schema: { type: string } } requestBody: required: true content: application/json: schema: type: object required: [ack_ids] properties: ack_ids: type: array maxItems: 1000 items: { type: string } responses: '204': { description: Acked }components: schemas: Message: type: object required: [id, data, publish_time] properties: id: { type: string } data: { type: string, format: byte } attributes: type: object additionalProperties: { type: string } ordering_key: { type: string, nullable: true } publish_time: { type: string, format: date-time } Subscription: type: object required: [name, topic, mode, ack_deadline_seconds] properties: name: { type: string } topic: { type: string } mode: type: string enum: [push, pull] delivery_url: type: string format: uri nullable: true ack_deadline_seconds: { type: integer, minimum: 10, maximum: 600 } filter: { type: string, nullable: true } dead_letter: type: object nullable: true properties: topic: { type: string } max_delivery_attempts: { type: integer, minimum: 5, maximum: 100 }Push body — raw HTTP#
The push delivery webhook receives this body:
{ "message": { "id": "msg_01HZ...", "data": "eyJvcmRlcl9pZCI6IDEyMyB9", "attributes": { "event": "order.created", "region": "us-east-1" }, "publish_time": "2026-05-30T12:34:56Z", "ordering_key": "customer-789" }, "ack_id": "ack_proj-1234_sub-orders_abc"}A 2xx response from the consumer means ack; anything else means nack and re-deliver after the policy’s backoff.
Client samples — three languages#
The publish + pull + ack triad, in three languages.
import base64, json, time, requests
API = "https://api.example.com"TOKEN = "Bearer eyJhbGciOi..."
def publish(topic, messages): body = {"messages": [ {"data": base64.b64encode(json.dumps(m).encode()).decode(), "attributes": {"event": m.get("type", "unknown")}} for m in messages ]} return requests.post( f"{API}/v1/topics/{topic}:publish", json=body, headers={"Authorization": TOKEN}, timeout=2, ).json()
def pull_and_ack(sub, processor): r = requests.post( f"{API}/v1/subscriptions/{sub}:pull", json={"max_messages": 100}, headers={"Authorization": TOKEN}, timeout=10, ).json()
ack_ids = [] for env in r.get("received_messages", []): data = base64.b64decode(env["message"]["data"]) try: processor(json.loads(data)) ack_ids.append(env["ack_id"]) except Exception: pass # don't ack; will redeliver
if ack_ids: requests.post( f"{API}/v1/subscriptions/{sub}:ack", json={"ack_ids": ack_ids}, headers={"Authorization": TOKEN}, )
publish("orders", [{"type": "order.created", "order_id": 123}])pull_and_ack("orders-billing", lambda m: print("got", m))package main
import ( "bytes" "encoding/base64" "encoding/json" "fmt" "net/http")
const API = "https://api.example.com"const TOKEN = "Bearer eyJhbGciOi..."
type Envelope struct { AckID string `json:"ack_id"` Message struct { ID string `json:"id"` Data string `json:"data"` } `json:"message"`}
func publish(topic string, data []byte) error { body, _ := json.Marshal(map[string]any{ "messages": []map[string]any{ {"data": base64.StdEncoding.EncodeToString(data)}, }, }) req, _ := http.NewRequest("POST", fmt.Sprintf("%s/v1/topics/%s:publish", API, topic), bytes.NewReader(body)) req.Header.Set("Authorization", TOKEN) req.Header.Set("Content-Type", "application/json") _, err := http.DefaultClient.Do(req) return err}
func pullAndAck(sub string, handle func([]byte) error) error { pullBody, _ := json.Marshal(map[string]int{"max_messages": 100}) req, _ := http.NewRequest("POST", fmt.Sprintf("%s/v1/subscriptions/%s:pull", API, sub), bytes.NewReader(pullBody)) req.Header.Set("Authorization", TOKEN) req.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(req) if err != nil { return err } defer resp.Body.Close()
var pr struct{ ReceivedMessages []Envelope `json:"received_messages"` } json.NewDecoder(resp.Body).Decode(&pr)
var ackIDs []string for _, env := range pr.ReceivedMessages { data, _ := base64.StdEncoding.DecodeString(env.Message.Data) if handle(data) == nil { ackIDs = append(ackIDs, env.AckID) } } if len(ackIDs) == 0 { return nil } ackBody, _ := json.Marshal(map[string]any{"ack_ids": ackIDs}) ar, _ := http.NewRequest("POST", fmt.Sprintf("%s/v1/subscriptions/%s:ack", API, sub), bytes.NewReader(ackBody)) ar.Header.Set("Authorization", TOKEN) ar.Header.Set("Content-Type", "application/json") _, err = http.DefaultClient.Do(ar) return err}
func main() { publish("orders", []byte(`{"order_id": 123}`)) pullAndAck("orders-billing", func(b []byte) error { fmt.Println("got", string(b)) return nil })}const API = "https://api.example.com";const TOKEN = "Bearer eyJhbGciOi...";
async function publish(topic, messages) { const body = { messages: messages.map((m) => ({ data: Buffer.from(JSON.stringify(m)).toString("base64"), attributes: { event: m.type ?? "unknown" }, })), }; const r = await fetch(`${API}/v1/topics/${topic}:publish`, { method: "POST", headers: { Authorization: TOKEN, "Content-Type": "application/json" }, body: JSON.stringify(body), }); return r.json();}
async function pullAndAck(sub, handle) { const r = await fetch(`${API}/v1/subscriptions/${sub}:pull`, { method: "POST", headers: { Authorization: TOKEN, "Content-Type": "application/json" }, body: JSON.stringify({ max_messages: 100 }), }); const { received_messages = [] } = await r.json();
const ackIds = []; for (const env of received_messages) { const data = JSON.parse(Buffer.from(env.message.data, "base64").toString()); try { await handle(data); ackIds.push(env.ack_id); } catch { // don't ack; will redeliver } } if (!ackIds.length) return; await fetch(`${API}/v1/subscriptions/${sub}:ack`, { method: "POST", headers: { Authorization: TOKEN, "Content-Type": "application/json" }, body: JSON.stringify({ ack_ids: ackIds }), });}
await publish("orders", [{ type: "order.created", order_id: 123 }]);await pullAndAck("orders-billing", (m) => console.log("got", m));Latency budget#
| Phase | Budget | Notes |
|---|---|---|
| Auth + validate | 5 ms | Cached JWT |
| Persist to topic log | 20 ms p95 | Quorum write to 3 replicas |
| Fan-out to subscriptions | 15 ms | Async; respond to publisher first |
| Publish response | 50 ms p95 | Total budget |
| Sub queue → pull/push | 1 s p95 end-to-end | Includes fan-out latency |
| Ack write | 10 ms | Single quorum write |
The publish budget is tight because producers block on it in the synchronous path. The end-to-end (publish to delivery) budget is looser because consumers tolerate it.
Trade-offs and extensions#
| Decision | Why | Cost if requirements change |
|---|---|---|
| At-least-once default | Achievable with simple infra | Consumers must be idempotent |
| Per-subscription queue | Independent consumer pace | More storage; 1 message becomes N copies |
| Push and pull both supported | Different consumer ergonomics | Two delivery paths to operate |
| 1 MB max message size | Tractable disk + network | Larger payloads via file-service reference |
| 7-day retention | Bounds storage; matches GCP/SNS | Compliance use cases want 30+ days |
| Resource-action verbs | Fits non-CRUD semantics | Less RESTful; mild style mismatch |
| Lease + nack model | Crash-safe redelivery | Slow consumers can extend lease or fall over |
A clean contrast on the delivery-mode axis:
Push mode
- Service POSTs to webhook
- Low operational burden for consumer
- Tight feedback loop on consumer failures
- Consumer must be publicly reachable
- Throughput limited by webhook 2xx rate
Pull mode
- Consumer calls
:pullon a schedule - Backpressure is consumer-controlled
- Long-poll keeps latency low
- Behind-NAT consumers work fine
- Consumer owns scheduling logic
Likely follow-up extensions:
- Ordering keys. Messages sharing an
ordering_keyare delivered in publish order to the same consumer. This requires per-key affinity at the queue level; throughput per key is bounded. - Filtering. Per-subscription filter expression on message attributes (e.g.
event=order.* AND region=us-*). Cheap to add at fan-out time; saves consumer bandwidth. - Replay. Subscription becomes time-windowed; admin can rewind to T or to a
message_id. Pushes the model towards Kafka and complicates retention. - Cross-region replication. Topics replicate asynchronously; subscriptions are per-region. Aurora-style topology.
- Schema registry. Producer declares Protobuf/Avro/JSON-Schema per topic; service validates at publish time.
Mock interview follow-ups#
- “What’s the difference between Pub/Sub and Kafka?” — Pub/Sub is a queue with topic-fanned-into-queues semantics; once consumed, messages are gone. Kafka is a distributed log; messages persist for the retention window regardless of consumption. Different abstractions, different operational profiles.
- “How do you handle a slow consumer?” — Lease-renew API for long-running work; per-subscription backlog metrics; eventually the backlog policy triggers either dead-lettering (drop oldest) or backpressure (publishers see 429).
- “What if a consumer crashes mid-processing?” — The lease expires; the message is redelivered to another instance. The consumer must be idempotent, keyed on
message.id. - “How do you achieve exactly-once?” — At-least-once delivery plus consumer-side dedup keyed on
message.idagainst a persistent store. Some platforms (Kafka, GCP Pub/Sub) offer transactional helpers, but the durable dedup is still the load-bearing primitive. - “What about ordering?” — Off by default. Opt-in per topic;
ordering_keydirects same-key messages to the same partition and the same consumer instance. Throughput per key is bounded by single-threaded consumer. - “Why not just use a database table as a queue?” — Polling latency, fairness, contention on
UPDATE ... SET locked=true. Production-grade pub-sub is a different optimisation target. - “How is this different from SQS?” — SQS is the per-subscription queue without the multi-subscriber topic; you compose SNS+SQS to get our shape. Our API is closer to GCP Pub/Sub, which presents topic and subscription as first-class resources.
Related#
- Design a Search Service API — comments and document updates flow through pub-sub on the way to the search index.
- Design a Comment Service API — emits comment-create events onto the pub-sub backbone.
- Design a File Service API — emits file-uploaded events for async scanning and indexing.
- Event-Driven Architecture Protocols — the building-block writeup on the protocols underneath this API.
- The API-Design Walk-through — the seven-step recipe this writeup followed.