ChatGPT-style Conversational System — System Design

Step 1 — Clarify Requirements#

Functional

A user sends a prompt; the system streams tokens back as they’re generated.
Multi-turn conversations: each turn’s context includes the prior messages.
Multiple model sizes (small / large / frontier) selectable per request.
Safety filters on both input and output.
Out of scope here: model training; fine-tuning infrastructure; multimodal (image/audio) input.

Non-functional

99.9% availability.
First-token latency (time-to-first-token) under 1 second p95.
Inter-token latency under 50 ms (so generation feels fluid at 20+ tokens/sec).
Throughput: hundreds of millions of tokens generated per minute at peak across the fleet.
Multi-tenant: a free-tier user shouldn’t block a paying user’s request.

Step 2 — Capacity Estimation#

MAU: ~200 M (ChatGPT-scale figures publicly disclosed).
Requests/sec at peak: ~50 K active conversations.
Tokens generated/sec: 50 K req × 100 tokens avg × ~1/30 s generation time = ~150 K tokens/sec aggregate.
GPU fleet: a frontier model generating ~50 tokens/sec/instance × need for 150 K/sec aggregate = ~3,000 instances. At ~ $30K/year per H100 (capex amortized), that's ~$ 90M/year just for the frontier model serving fleet — explains why providers price hard against inference cost.
Context length per request: typical 2-8K tokens, up to ~200K. Each token in KV cache = ~2 × hidden_dim × num_layers × 2 bytes ≈ ~80 KB at frontier-model dimensions. A 100K-token context: ~8 GB of KV cache per active request.
Storage (conversation history): ~1 KB per message × 100 messages/user × 200 M users = 20 TB (modest).

The dominant resource is GPU memory and compute. Storage and CPU are afterthoughts.

Step 3 — System Interface#

POST /v1/chat/completions
  Body: {
    model: "gpt-x-large" | ...,
    messages: [{role, content}, ...],
    max_tokens: int,
    stream: bool,
    temperature: float,
    ...
  }

  If stream=false: synchronous response with full output.
  If stream=true: Server-Sent Events with one token / chunk per event.

POST /v1/conversations         (server-side state, returns conversation_id)
POST /v1/conversations/:id/messages
GET  /v1/models

The streaming protocol is the user experience — first-token latency is more important than total latency because users start reading the moment text appears.

Step 4 — High-Level Design#

                                    safety pre-filter (input toxicity, prompt-injection signals)
                                                  │
client (SSE) ── LB ── API gateway ── auth ────────┴── conversation store (Postgres / KV)
                                                  │
                                                  ▼
                                       request router (model + KV-cache aware)
                                                  │
                                                  ▼
                              ┌─────── GPU inference fleet (sharded by model) ─────────┐
                              │                                                          │
                              │  • prefill pool (large compute, short bursts)            │
                              │  • decode pool (small compute, long-running KV state)    │
                              │  • KV cache memory tier (shared, content-addressed)      │
                              │                                                          │
                              └──────────────────────────────────────────────────────────┘
                                                  │
                                                  ▼
                                       safety post-filter (output classifier)
                                                  │
                                                  ▼ stream back to client

The inference fleet is the soul of the system. Two distinct phases of LLM inference need different scheduling:

Prefill: process the entire input prompt at once (GPU-compute-bound, ~5-50 ms for typical contexts).
Decode: generate one token at a time, attending to the growing KV cache (GPU-memory-bandwidth-bound, ~20-50 ms per token).

Many modern systems split these into different pools.

Step 5 — Data Model#

Conversations:

table conversations
  conv_id      uuid     PK
  user_id      uuid
  model        string
  created_at   timestamp
  updated_at   timestamp

table messages
  conv_id      uuid       PK
  msg_id       timeuuid   CK
  role         enum(user, assistant, system, tool)
  content      text
  tokens_in    int
  tokens_out   int
  cost_cents   decimal

KV cache index (in-memory across the inference fleet):

key: hash(conversation_prefix_tokens) → kv_cache_handle on GPU_node_N

The cache is content-addressed by token-prefix hash. When request R2 starts with the same prefix as R1 (very common in multi-turn chat), R2 reuses R1’s prefix KV — saving the prefill cost.

Step 6 — Detailed Design#

Streaming and SSE#

The server sends data: { "delta": "token" } events as fast as decode emits them. Backpressure is bounded by the client’s read rate; if the client falls behind, decode stalls or buffers a few hundred tokens.

Network frame: each event is ~30 B (token text + JSON wrapping). At 50 tokens/sec, that’s 1.5 KB/sec of egress — trivial.

Prefill / decode separation#

Request R arrives:
1. Hit conversation store; assemble messages array; tokenize to [t0..tN].
2. KV-cache lookup: does any GPU node hold a prefix of [t0..tN]?
3. If yes (cache hit for first K tokens):
     route to that GPU node; prefill only [tK..tN].
4. If no:
     prefill on prefill pool; KV state created; pass handle to decode pool.
5. Decode pool generates tokens one at a time, attending to the KV state.
6. As each token emits, stream back via SSE.

Prefill is compute-heavy and short — a single GPU node can process many prefill jobs per second. Decode is memory-bandwidth-heavy and long-running (lasts until generation ends, can be many seconds). Different schedulers fit each.

Continuous batching#

Instead of waiting for fixed-size batches (the old way), decode runs in continuous batching mode:

At each decode step:
  - Take all currently-active requests on this GPU node.
  - Compute the next token for each (in one batched forward pass).
  - Some requests finish; remove them.
  - New requests slot into freed positions immediately.

This keeps the GPU at high utilization without imposing latency on individual requests. The first decode pass might batch 32 requests; by the time some finish, new ones have joined. Same batch slot can serve dozens of different requests sequentially.

Routing and the cache-aware load balancer#

The router is not naive round-robin. For each request, it considers:

Which model is requested.
What’s the KV-cache prefix-hit ratio across nodes for this conversation?
Which nodes have GPU memory headroom for the conversation’s projected context length?

Route to the node maximizing (cache_hit + memory_fit). The result is affinity for a conversation to a node — keeps multi-turn chats cheap.

Safety filtering#

Sync filtering — run a small classifier on the prompt before inference; refuse if flagged. Adds latency, blocks before GPU is committed. Good for high-risk categories.

Streaming output filter — classify tokens as they’re emitted; halt generation and emit a refusal mid-stream if a problematic pattern emerges. Lower latency for clean prompts; minor risk of leaking a few bad tokens before halt.

Most providers use both. Pre-input filtering catches obvious abuse; streaming output filtering catches model-emergent harms.

Multi-tenant scheduling and rate limiting#

Each user / API key has:

A QPS limit.
A tokens-per-minute limit.
A priority class (free / pro / enterprise).

The request router uses these to enforce a weighted fair share across the GPU fleet. A free user’s request waits in queue if the fleet is saturated; a paying user’s request gets routed to a less-loaded shard or jumps the queue.

Cost accounting#

Per-request cost is approximately prefill_tokens × input_rate + output_tokens × output_rate. Stored in the message row at request completion. Aggregated for billing.

Latency budget (target first-token `< 1 s` p95)#

LB + TLS:                       30 ms
Auth + rate limit:               5 ms
Conversation history fetch:      10 ms
Tokenize:                        5 ms
Cache lookup + route:            5 ms
Prefill (cache miss, 2K tokens): 200 ms
First decode step:               40 ms
Token serialized to SSE:         5 ms
Network back:                    30 ms
                       total:   ~330 ms p50, ~800 ms p95

A cache hit collapses prefill to near-zero, getting p50 first-token to ~100 ms — the difference between “instant” and “thinking”.

Long-context handling#

A 200K-token context needs 16+ GB of KV cache. Strategies:

Page the KV cache to host RAM when not active; bring back on resume. Costs latency (~100 ms per GB).
Compress the cache with techniques like attention-sink or KV quantization.
Truncate older context that hasn’t been referenced recently (user-aware policy, not invisible).

A frontier model serving long contexts at scale is dominated by these memory-management trade-offs.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: GPU memory. Each request consumes KV memory proportional to its context length. The number of concurrent requests per GPU is gpu_memory / per_request_kv. For a 200K-token context, this is single-digit concurrency per GPU — limiting throughput. Mitigation: paged KV (vLLM-style) and aggressive KV quantization.

Bottleneck #2: tail latency from preemption. If a GPU is busy with a long-running decode and a high-priority request lands, preempting is costly (state save/restore). Most systems use per-priority pools — premium requests have reserved capacity, eliminating the preemption case.

Bottleneck #3: cache eviction churn. A KV cache holding millions of partial conversations needs eviction. Naive LRU misses the recently-active conversations in a long-tail of users. A hybrid eviction (recency + size + per-user fairness) is the production shape.

Alternative I’d push back on: synchronous, non-streaming responses for chat. The user perceives latency very differently when they see tokens appearing vs waiting for the full response. Streaming is not just a UX preference — it’s a 5×+ perceived-latency improvement.

What breaks first at 10× scale: GPU fleet capex. Inference cost is the bottleneck of every LLM provider, full stop. Solutions are mostly model-side (smaller / sparser / quantized models) rather than systems-side; the systems lever pulls are about utilization (continuous batching), cache reuse, and routing efficiency.

Companies this resembles#

OpenAI ChatGPT, Anthropic Claude, Google Gemini, Meta AI, Mistral platforms, xAI Grok, and the open-source serving stack (vLLM, TGI, TensorRT-LLM).

AI / ML Data Infrastructure — feeds the offline training pipelines this inference fleet serves.
LLM-Powered Customer Support Bot — a specific application on top.
AI-Powered Code Assistant — same inference primitives, latency-sensitive variant.
Distributed Task Scheduler — the substrate for non-realtime background generation.