Model Optimization for Deployment — Gen AI

Summary#

A frontier model out of training is too expensive to serve at scale. Inference cost — measured in dollars per million tokens, latency at p50/p99, and concurrent users per accelerator — is what determines whether a product makes money or burns it. Model optimization for deployment is the toolbox that closes the gap between the trained model and what your unit economics can sustain.

The levers fall into three buckets. Model-side optimizations shrink or simplify the weights: quantization (fewer bits per parameter), distillation (smaller student model trained to mimic a larger teacher), pruning (removing structurally unimportant weights). Runtime optimizations improve how each token is produced: KV-cache reuse, paged attention, continuous batching, FlashAttention-class kernels. Decoding-side optimizations reduce the number of forward passes needed per output token: speculative decoding, lookahead, multi-token prediction. Most production stacks combine all three.

Why it matters#

The cost of running an LLM at a million-user scale is dominated by GPU-hours, which scale roughly linearly with model size, sequence length, and request volume. A 2x reduction in cost-per-token at fixed quality is the difference between a sustainable product and a money-losing one. Every frontier lab and every serious inference platform — vLLM, TensorRT-LLM, SGLang, TGI — exists to wring more tokens per second per dollar out of the same model.

For application engineers, this matters because the choices upstream cascade. Picking a 70B model when a quantized 13B distillate would do means your serving cost is ~5x higher and your latency is ~3x worse. Picking a runtime without continuous batching halves your hardware utilization. Knowing the optimization landscape lets you build a stack that doesn’t get out-economised the moment traffic grows.

How it works#

Quantization: fewer bits per weight#

Models are trained in bfloat16 (16 bits per parameter). At inference you can often quantize to 8 bits (INT8, FP8), 4 bits (INT4, NF4, AWQ, GPTQ), or even lower with minor quality loss. Quantization reduces memory footprint (linearly with bit width) and memory bandwidth (the dominant inference bottleneck on modern GPUs).

Weight-only quantization (GPTQ, AWQ) quantizes the weights but keeps activations in fp16/bfloat16. Quality holds up well to 4 bits; the GPU dequantizes on the fly. Weight-and-activation quantization (SmoothQuant, FP8) pushes further but is harder to keep stable. Quantization-aware training (QAT) retrains the model with simulated quantization in the loss, giving better quality than post-training quantization at the cost of more compute.

In 2026, 4-bit weight-only quantization is the default for serving open-weights LLMs — a 70B model fits on a single 80GB GPU at INT4 instead of needing two GPUs at bfloat16.

Distillation: smaller student, similar behaviour#

Distillation trains a smaller “student” model to imitate a larger “teacher”. The student sees the teacher’s full output distribution at every token (soft labels), not just the argmax, which carries far more information than a hard label. Add some supervised data on top and the student often reaches 80–95% of the teacher’s quality at a fraction of the cost.

Production examples: Phi-3 distilled from larger teachers, Gemini-Nano from Gemini-Pro, Claude-Haiku from Claude-Sonnet/Opus, distilled coding models from DeepSeek-Coder-V2. The pattern is so reliable that frontier labs increasingly publish distillates alongside their flagship models.

Pruning: remove what doesn’t matter#

Pruning identifies weights, attention heads, or layers that contribute little to the output and removes them. Unstructured pruning zeros out individual weights (good compression ratio, hard for GPUs to exploit). Structured pruning removes whole rows, columns, or layers (worse compression, much better hardware speedup). 2:4 sparsity (every 4 weights, 2 are zero in a specific pattern) is a sweet spot — Nvidia’s sparse-tensor cores accelerate it natively for ~2x speedup at minor quality loss.

Pruning in 2026 is less popular than quantization for LLMs — most of the easy quality is already extractable via 4-bit quantization, and pruning is harder to combine with the dynamic shapes of attention. It’s still common in vision and audio models.

KV-cache reuse and paged attention#

During autoregressive generation, every new token requires attention over all previous tokens. Naively, that’s quadratic in sequence length. The fix is to cache the K and V projections for every token already generated — the KV cache — and only compute new K and V for the new token.

The KV cache dominates inference memory for long contexts. At 70B parameters with a 32K context, the KV cache can be larger than the model weights. Paged attention (vLLM) manages the cache in fixed-size pages like virtual memory, eliminating fragmentation and allowing efficient multi-tenant serving. Multi-query attention (MQA) and grouped-query attention (GQA) share K/V projections across attention heads, shrinking the cache by ~4–8x with minor quality cost — universal in modern open-weights models.

Speculative decoding: predict multiple tokens at once#

Autoregressive decoding produces one token per forward pass — a hard ceiling on speed. Speculative decoding uses a small, fast “draft” model to propose k tokens at a time, then verifies them in a single forward pass of the full target model. If the draft and target agree on the first j tokens, you’ve gotten j tokens for one forward pass. Quality is preserved exactly (the verification is unbiased).

Typical speedup is 2–3x on language workloads, more on highly predictable tasks like code. Self-speculative decoding (Medusa, EAGLE) trains additional heads on the target model itself instead of using a separate draft — simpler ops, similar speedup. In 2026, speculative decoding is standard on every serious inference runtime.

Variants and trade-offs#

Quantization — applies to any model post-hoc, no retraining required (for PTQ), preserves the model’s architecture and capabilities. Quality degrades at very low bit-widths; some tasks (math, code) are more sensitive than others.

Distillation — produces a smaller architecture with similar capabilities, biggest absolute speedup, lets you keep frontier quality at edge-device cost. Requires a training run, training-set design is non-trivial, the student inherits the teacher’s biases.

Other axes:

Static vs dynamic shapes. Speculative decoding, paged attention, and continuous batching all introduce dynamic shapes that complicate compilation. TensorRT-LLM and torch.compile both work hard to make this efficient; some older runtimes (vanilla ONNX Runtime) struggle.
Throughput vs latency. Continuous batching maximizes throughput (tokens/sec across all users) but adds head-of-line latency. Speculative decoding reduces per-request latency but uses more FLOPs per token. The right balance depends on whether you’re serving an interactive chat (latency-sensitive) or a batch summarisation job (throughput-sensitive).
Prefill vs decode. Prefill (processing the prompt) is compute-bound and parallel. Decode (generating output) is memory-bandwidth-bound and sequential. Most optimizations target one phase or the other — e.g., FlashAttention helps prefill, speculative decoding helps decode. A balanced stack does both.
Quantization for fine-tunes. QLoRA-fine-tuned models can be re-quantized after training; the LoRA delta is small and can be merged back. This is how a frontier-quality 70B fine-tune ends up serving on a single GPU at INT4.

Why 'just make it faster' isn't free

Every optimization above has a quality cost, even when the headline number is “lossless”. 4-bit quantization at the perplexity level often degrades the hardest tasks (multi-step math, structured output) by a few percentage points. Distilled students lose long-tail capability. Speculative decoding is unbiased on average but can introduce subtle bias if the draft model is poorly aligned with the target. The discipline is to evaluate every optimization on your evals, not just on MMLU. Many teams have shipped an optimized model that looked fine on standard benchmarks and then watched their production metrics regress — because their users hit the corner cases the benchmarks don’t cover.

When this is asked in interviews#

This is the load-bearing senior question on AI-infrastructure, ML-platform, and inference-engineering loops. It also shows up on AI-product loops where the team has had to optimize a real serving stack. The interviewer wants to see that you can think about the inference stack end-to-end, not just name techniques.

What they’re checking:

Do you know the three buckets (model-side, runtime-side, decoding-side) and can you place specific techniques in each.
Can you reason about the bottleneck of a given workload — prefill-bound vs decode-bound, memory-bandwidth-bound vs compute-bound.
Have you actually deployed a model and seen quantization or batching change the unit economics — not just read about it.

Common follow-ups:

“Why does quantization help latency, not just memory?” — modern GPUs are memory-bandwidth-bound during decode, so halving the weight size roughly halves decode latency.
“When does speculative decoding not help?” — when the draft model disagrees often (uncommon tasks, low-temperature sampling on hard prompts), or when the verifier is already cheap (small models, edge devices).
“How do you size hardware for a new model?” — model weights + KV cache for your target context length + activation memory, with ~20% headroom. Then compute the throughput at your batch size and check it meets your tokens-per-second-per-GPU target.