ChatGPT-style Conversational System

Streaming inference, KV-cache reuse, request routing, safety filters, multi-tenant GPU scheduling.

System Advanced
8 min read
llm inference gpu streaming
Companies this resembles: OpenAI ChatGPT · Anthropic Claude · Google Gemini · Meta AI

Step 1 — Clarify Requirements#

Functional

  • A user sends a prompt; the system streams tokens back as they’re generated.
  • Multi-turn conversations: each turn’s context includes the prior messages.
  • Multiple model sizes (small / large / frontier) selectable per request.
  • Safety filters on both input and output.
  • Out of scope here: model training; fine-tuning infrastructure; multimodal (image/audio) input.

Non-functional

  • 99.9% availability.
  • First-token latency (time-to-first-token) under 1 second p95.
  • Inter-token latency under 50 ms (so generation feels fluid at 20+ tokens/sec).
  • Throughput: hundreds of millions of tokens generated per minute at peak across the fleet.
  • Multi-tenant: a free-tier user shouldn’t block a paying user’s request.

Step 2 — Capacity Estimation#

  • MAU: ~200 M (ChatGPT-scale figures publicly disclosed).
  • Requests/sec at peak: ~50 K active conversations.
  • Tokens generated/sec: 50 K req × 100 tokens avg × ~1/30 s generation time = ~150 K tokens/sec aggregate.
  • GPU fleet: a frontier model generating ~50 tokens/sec/instance × need for 150 K/sec aggregate = ~3,000 instances. At ~30K/yearperH100(capexamortized),thats 30K/year per H100 (capex amortized), that's ~90M/year just for the frontier model serving fleet — explains why providers price hard against inference cost.
  • Context length per request: typical 2-8K tokens, up to ~200K. Each token in KV cache = ~2 × hidden_dim × num_layers × 2 bytes ≈ ~80 KB at frontier-model dimensions. A 100K-token context: ~8 GB of KV cache per active request.
  • Storage (conversation history): ~1 KB per message × 100 messages/user × 200 M users = 20 TB (modest).

The dominant resource is GPU memory and compute. Storage and CPU are afterthoughts.

Step 3 — System Interface#

POST /v1/chat/completions
Body: {
model: "gpt-x-large" | ...,
messages: [{role, content}, ...],
max_tokens: int,
stream: bool,
temperature: float,
...
}
If stream=false: synchronous response with full output.
If stream=true: Server-Sent Events with one token / chunk per event.
POST /v1/conversations (server-side state, returns conversation_id)
POST /v1/conversations/:id/messages
GET /v1/models

The streaming protocol is the user experience — first-token latency is more important than total latency because users start reading the moment text appears.

Step 4 — High-Level Design#

safety pre-filter (input toxicity, prompt-injection signals)
client (SSE) ── LB ── API gateway ── auth ────────┴── conversation store (Postgres / KV)
request router (model + KV-cache aware)
┌─────── GPU inference fleet (sharded by model) ─────────┐
│ │
│ • prefill pool (large compute, short bursts) │
│ • decode pool (small compute, long-running KV state) │
│ • KV cache memory tier (shared, content-addressed) │
│ │
└──────────────────────────────────────────────────────────┘
safety post-filter (output classifier)
▼ stream back to client

The inference fleet is the soul of the system. Two distinct phases of LLM inference need different scheduling:

  • Prefill: process the entire input prompt at once (GPU-compute-bound, ~5-50 ms for typical contexts).
  • Decode: generate one token at a time, attending to the growing KV cache (GPU-memory-bandwidth-bound, ~20-50 ms per token).

Many modern systems split these into different pools.

Step 5 — Data Model#

Conversations:

table conversations
conv_id uuid PK
user_id uuid
model string
created_at timestamp
updated_at timestamp
table messages
conv_id uuid PK
msg_id timeuuid CK
role enum(user, assistant, system, tool)
content text
tokens_in int
tokens_out int
cost_cents decimal

KV cache index (in-memory across the inference fleet):

key: hash(conversation_prefix_tokens) → kv_cache_handle on GPU_node_N

The cache is content-addressed by token-prefix hash. When request R2 starts with the same prefix as R1 (very common in multi-turn chat), R2 reuses R1’s prefix KV — saving the prefill cost.

Step 6 — Detailed Design#

Streaming and SSE#

The server sends data: { "delta": "token" } events as fast as decode emits them. Backpressure is bounded by the client’s read rate; if the client falls behind, decode stalls or buffers a few hundred tokens.

Network frame: each event is ~30 B (token text + JSON wrapping). At 50 tokens/sec, that’s 1.5 KB/sec of egress — trivial.

Prefill / decode separation#

Request R arrives:
1. Hit conversation store; assemble messages array; tokenize to [t0..tN].
2. KV-cache lookup: does any GPU node hold a prefix of [t0..tN]?
3. If yes (cache hit for first K tokens):
route to that GPU node; prefill only [tK..tN].
4. If no:
prefill on prefill pool; KV state created; pass handle to decode pool.
5. Decode pool generates tokens one at a time, attending to the KV state.
6. As each token emits, stream back via SSE.

Prefill is compute-heavy and short — a single GPU node can process many prefill jobs per second. Decode is memory-bandwidth-heavy and long-running (lasts until generation ends, can be many seconds). Different schedulers fit each.

Continuous batching#

Instead of waiting for fixed-size batches (the old way), decode runs in continuous batching mode:

At each decode step:
- Take all currently-active requests on this GPU node.
- Compute the next token for each (in one batched forward pass).
- Some requests finish; remove them.
- New requests slot into freed positions immediately.

This keeps the GPU at high utilization without imposing latency on individual requests. The first decode pass might batch 32 requests; by the time some finish, new ones have joined. Same batch slot can serve dozens of different requests sequentially.

Routing and the cache-aware load balancer#

The router is not naive round-robin. For each request, it considers:

  • Which model is requested.
  • What’s the KV-cache prefix-hit ratio across nodes for this conversation?
  • Which nodes have GPU memory headroom for the conversation’s projected context length?

Route to the node maximizing (cache_hit + memory_fit). The result is affinity for a conversation to a node — keeps multi-turn chats cheap.

Safety filtering#

Sync filtering — run a small classifier on the prompt before inference; refuse if flagged. Adds latency, blocks before GPU is committed. Good for high-risk categories.
Streaming output filter — classify tokens as they’re emitted; halt generation and emit a refusal mid-stream if a problematic pattern emerges. Lower latency for clean prompts; minor risk of leaking a few bad tokens before halt.

Most providers use both. Pre-input filtering catches obvious abuse; streaming output filtering catches model-emergent harms.

Multi-tenant scheduling and rate limiting#

Each user / API key has:

  • A QPS limit.
  • A tokens-per-minute limit.
  • A priority class (free / pro / enterprise).

The request router uses these to enforce a weighted fair share across the GPU fleet. A free user’s request waits in queue if the fleet is saturated; a paying user’s request gets routed to a less-loaded shard or jumps the queue.

Cost accounting#

Per-request cost is approximately prefill_tokens × input_rate + output_tokens × output_rate. Stored in the message row at request completion. Aggregated for billing.

Latency budget (target first-token < 1 s p95)#

LB + TLS: 30 ms
Auth + rate limit: 5 ms
Conversation history fetch: 10 ms
Tokenize: 5 ms
Cache lookup + route: 5 ms
Prefill (cache miss, 2K tokens): 200 ms
First decode step: 40 ms
Token serialized to SSE: 5 ms
Network back: 30 ms
total: ~330 ms p50, ~800 ms p95

A cache hit collapses prefill to near-zero, getting p50 first-token to ~100 ms — the difference between “instant” and “thinking”.

Long-context handling#

A 200K-token context needs 16+ GB of KV cache. Strategies:

  • Page the KV cache to host RAM when not active; bring back on resume. Costs latency (~100 ms per GB).
  • Compress the cache with techniques like attention-sink or KV quantization.
  • Truncate older context that hasn’t been referenced recently (user-aware policy, not invisible).

A frontier model serving long contexts at scale is dominated by these memory-management trade-offs.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: GPU memory. Each request consumes KV memory proportional to its context length. The number of concurrent requests per GPU is gpu_memory / per_request_kv. For a 200K-token context, this is single-digit concurrency per GPU — limiting throughput. Mitigation: paged KV (vLLM-style) and aggressive KV quantization.

Bottleneck #2: tail latency from preemption. If a GPU is busy with a long-running decode and a high-priority request lands, preempting is costly (state save/restore). Most systems use per-priority pools — premium requests have reserved capacity, eliminating the preemption case.

Bottleneck #3: cache eviction churn. A KV cache holding millions of partial conversations needs eviction. Naive LRU misses the recently-active conversations in a long-tail of users. A hybrid eviction (recency + size + per-user fairness) is the production shape.

Alternative I’d push back on: synchronous, non-streaming responses for chat. The user perceives latency very differently when they see tokens appearing vs waiting for the full response. Streaming is not just a UX preference — it’s a 5×+ perceived-latency improvement.

What breaks first at 10× scale: GPU fleet capex. Inference cost is the bottleneck of every LLM provider, full stop. Solutions are mostly model-side (smaller / sparser / quantized models) rather than systems-side; the systems lever pulls are about utilization (continuous batching), cache reuse, and routing efficiency.

Companies this resembles#

OpenAI ChatGPT, Anthropic Claude, Google Gemini, Meta AI, Mistral platforms, xAI Grok, and the open-source serving stack (vLLM, TGI, TensorRT-LLM).

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.