ChatGPT-style Conversational System
Streaming inference, KV-cache reuse, request routing, safety filters, multi-tenant GPU scheduling.
Step 1 — Clarify Requirements#
Functional
- A user sends a prompt; the system streams tokens back as they’re generated.
- Multi-turn conversations: each turn’s context includes the prior messages.
- Multiple model sizes (small / large / frontier) selectable per request.
- Safety filters on both input and output.
- Out of scope here: model training; fine-tuning infrastructure; multimodal (image/audio) input.
Non-functional
- 99.9% availability.
- First-token latency (time-to-first-token) under 1 second p95.
- Inter-token latency under 50 ms (so generation feels fluid at 20+ tokens/sec).
- Throughput: hundreds of millions of tokens generated per minute at peak across the fleet.
- Multi-tenant: a free-tier user shouldn’t block a paying user’s request.
Step 2 — Capacity Estimation#
- MAU: ~200 M (ChatGPT-scale figures publicly disclosed).
- Requests/sec at peak: ~50 K active conversations.
- Tokens generated/sec: 50 K req × 100 tokens avg × ~1/30 s generation time = ~150 K tokens/sec aggregate.
- GPU fleet: a frontier model generating ~50 tokens/sec/instance × need for 150 K/sec aggregate = ~3,000 instances. At ~90M/year just for the frontier model serving fleet — explains why providers price hard against inference cost.
- Context length per request: typical 2-8K tokens, up to ~200K. Each token in KV cache = ~2 × hidden_dim × num_layers × 2 bytes ≈ ~80 KB at frontier-model dimensions. A 100K-token context: ~8 GB of KV cache per active request.
- Storage (conversation history): ~1 KB per message × 100 messages/user × 200 M users = 20 TB (modest).
The dominant resource is GPU memory and compute. Storage and CPU are afterthoughts.
Step 3 — System Interface#
POST /v1/chat/completions Body: { model: "gpt-x-large" | ..., messages: [{role, content}, ...], max_tokens: int, stream: bool, temperature: float, ... }
If stream=false: synchronous response with full output. If stream=true: Server-Sent Events with one token / chunk per event.
POST /v1/conversations (server-side state, returns conversation_id)POST /v1/conversations/:id/messagesGET /v1/modelsThe streaming protocol is the user experience — first-token latency is more important than total latency because users start reading the moment text appears.
Step 4 — High-Level Design#
safety pre-filter (input toxicity, prompt-injection signals) │client (SSE) ── LB ── API gateway ── auth ────────┴── conversation store (Postgres / KV) │ ▼ request router (model + KV-cache aware) │ ▼ ┌─────── GPU inference fleet (sharded by model) ─────────┐ │ │ │ • prefill pool (large compute, short bursts) │ │ • decode pool (small compute, long-running KV state) │ │ • KV cache memory tier (shared, content-addressed) │ │ │ └──────────────────────────────────────────────────────────┘ │ ▼ safety post-filter (output classifier) │ ▼ stream back to clientThe inference fleet is the soul of the system. Two distinct phases of LLM inference need different scheduling:
- Prefill: process the entire input prompt at once (GPU-compute-bound, ~5-50 ms for typical contexts).
- Decode: generate one token at a time, attending to the growing KV cache (GPU-memory-bandwidth-bound, ~20-50 ms per token).
Many modern systems split these into different pools.
Step 5 — Data Model#
Conversations:
table conversations conv_id uuid PK user_id uuid model string created_at timestamp updated_at timestamp
table messages conv_id uuid PK msg_id timeuuid CK role enum(user, assistant, system, tool) content text tokens_in int tokens_out int cost_cents decimalKV cache index (in-memory across the inference fleet):
key: hash(conversation_prefix_tokens) → kv_cache_handle on GPU_node_NThe cache is content-addressed by token-prefix hash. When request R2 starts with the same prefix as R1 (very common in multi-turn chat), R2 reuses R1’s prefix KV — saving the prefill cost.
Step 6 — Detailed Design#
Streaming and SSE#
The server sends data: { "delta": "token" } events as fast as decode emits them. Backpressure is bounded by the client’s read rate; if the client falls behind, decode stalls or buffers a few hundred tokens.
Network frame: each event is ~30 B (token text + JSON wrapping). At 50 tokens/sec, that’s 1.5 KB/sec of egress — trivial.
Prefill / decode separation#
Request R arrives:1. Hit conversation store; assemble messages array; tokenize to [t0..tN].2. KV-cache lookup: does any GPU node hold a prefix of [t0..tN]?3. If yes (cache hit for first K tokens): route to that GPU node; prefill only [tK..tN].4. If no: prefill on prefill pool; KV state created; pass handle to decode pool.5. Decode pool generates tokens one at a time, attending to the KV state.6. As each token emits, stream back via SSE.Prefill is compute-heavy and short — a single GPU node can process many prefill jobs per second. Decode is memory-bandwidth-heavy and long-running (lasts until generation ends, can be many seconds). Different schedulers fit each.
Continuous batching#
Instead of waiting for fixed-size batches (the old way), decode runs in continuous batching mode:
At each decode step: - Take all currently-active requests on this GPU node. - Compute the next token for each (in one batched forward pass). - Some requests finish; remove them. - New requests slot into freed positions immediately.This keeps the GPU at high utilization without imposing latency on individual requests. The first decode pass might batch 32 requests; by the time some finish, new ones have joined. Same batch slot can serve dozens of different requests sequentially.
Routing and the cache-aware load balancer#
The router is not naive round-robin. For each request, it considers:
- Which model is requested.
- What’s the KV-cache prefix-hit ratio across nodes for this conversation?
- Which nodes have GPU memory headroom for the conversation’s projected context length?
Route to the node maximizing (cache_hit + memory_fit). The result is affinity for a conversation to a node — keeps multi-turn chats cheap.
Safety filtering#
Most providers use both. Pre-input filtering catches obvious abuse; streaming output filtering catches model-emergent harms.
Multi-tenant scheduling and rate limiting#
Each user / API key has:
- A QPS limit.
- A tokens-per-minute limit.
- A priority class (free / pro / enterprise).
The request router uses these to enforce a weighted fair share across the GPU fleet. A free user’s request waits in queue if the fleet is saturated; a paying user’s request gets routed to a less-loaded shard or jumps the queue.
Cost accounting#
Per-request cost is approximately prefill_tokens × input_rate + output_tokens × output_rate. Stored in the message row at request completion. Aggregated for billing.
Latency budget (target first-token < 1 s p95)#
LB + TLS: 30 msAuth + rate limit: 5 msConversation history fetch: 10 msTokenize: 5 msCache lookup + route: 5 msPrefill (cache miss, 2K tokens): 200 msFirst decode step: 40 msToken serialized to SSE: 5 msNetwork back: 30 ms total: ~330 ms p50, ~800 ms p95A cache hit collapses prefill to near-zero, getting p50 first-token to ~100 ms — the difference between “instant” and “thinking”.
Long-context handling#
A 200K-token context needs 16+ GB of KV cache. Strategies:
- Page the KV cache to host RAM when not active; bring back on resume. Costs latency (~100 ms per GB).
- Compress the cache with techniques like attention-sink or KV quantization.
- Truncate older context that hasn’t been referenced recently (user-aware policy, not invisible).
A frontier model serving long contexts at scale is dominated by these memory-management trade-offs.
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: GPU memory. Each request consumes KV memory proportional to its context length. The number of concurrent requests per GPU is gpu_memory / per_request_kv. For a 200K-token context, this is single-digit concurrency per GPU — limiting throughput. Mitigation: paged KV (vLLM-style) and aggressive KV quantization.
Bottleneck #2: tail latency from preemption. If a GPU is busy with a long-running decode and a high-priority request lands, preempting is costly (state save/restore). Most systems use per-priority pools — premium requests have reserved capacity, eliminating the preemption case.
Bottleneck #3: cache eviction churn. A KV cache holding millions of partial conversations needs eviction. Naive LRU misses the recently-active conversations in a long-tail of users. A hybrid eviction (recency + size + per-user fairness) is the production shape.
Alternative I’d push back on: synchronous, non-streaming responses for chat. The user perceives latency very differently when they see tokens appearing vs waiting for the full response. Streaming is not just a UX preference — it’s a 5×+ perceived-latency improvement.
What breaks first at 10× scale: GPU fleet capex. Inference cost is the bottleneck of every LLM provider, full stop. Solutions are mostly model-side (smaller / sparser / quantized models) rather than systems-side; the systems lever pulls are about utilization (continuous batching), cache reuse, and routing efficiency.
Companies this resembles#
OpenAI ChatGPT, Anthropic Claude, Google Gemini, Meta AI, Mistral platforms, xAI Grok, and the open-source serving stack (vLLM, TGI, TensorRT-LLM).
Related systems#
- AI / ML Data Infrastructure — feeds the offline training pipelines this inference fleet serves.
- LLM-Powered Customer Support Bot — a specific application on top.
- AI-Powered Code Assistant — same inference primitives, latency-sensitive variant.
- Distributed Task Scheduler — the substrate for non-realtime background generation.