Foundation Models
The lifecycle of a foundation model — pretraining, post-training, evaluation, optimization for deployment. The model-as-a-system view, not the architecture view.
If the Architectures topic asks "how is this model built?", this topic asks "how does a model become useful?" Pretraining produces a raw capability; post-training (instruction tuning, RLHF, DPO) makes it follow directions; evaluation tells you whether it actually does; optimization makes it cheap enough to ship.
This lifecycle is what separates a research artifact from a product. Most engineers will never train a frontier model from scratch — but every engineer who serves one needs to understand these stages, because each stage is a lever for fixing the model's failures.
Key concepts
- Pretraining = capability; post-training = behavior. The model knows things vs the model does what you ask
- RLHF and DPO are reward-shaping tools — they're how you trade off helpfulness, harmlessness, and honesty
- Evaluation is partly art — every benchmark over-fits; holistic eval matters more than any single number
- Quantization and distillation are the difference between $10/1M-tokens and $0.10/1M-tokens
- Scaling laws are real but local — they predict next-token loss, not capability cliffs
Reference template
// The foundation-model lifecycle
1. Data curation (what does the model see during pretraining?)
2. Pretraining (the expensive bit — typically a single large run)
3. Post-training (SFT → RLHF / DPO → safety tuning)
4. Evaluation (benchmarks + holistic + adversarial)
5. Optimization (quantization, distillation, KV-cache reuse)
6. Serving (latency, throughput, cost per token)
7. Continuous iteration (each new release rewinds and adjusts) Adapt to your problem; the structure is the load-bearing part.
Common pitfalls
- Treating fine-tuning as the answer to every problem — most problems are solved better with retrieval or prompting
- Underestimating data quality — model behavior is mostly the data, especially at the post-training stage
- Confusing capability with reliability — a 90% accurate model is a 0% deployable product without guardrails
- Skipping the eval step — "it looked good in my chat" is not a benchmark
Related topics
Items (8)
- What Are Foundation Models?
Large, broadly-pretrained models that serve as starting points for many downstream tasks. The reusable substrate of modern AI.
Concept Foundational - How Do Models Learn?
Gradient descent, backpropagation, loss functions, and the optimization loop. The engine under every neural network.
Concept Foundational - Pretraining Paradigms
Causal vs masked vs contrastive vs span-corruption. The objective you pick determines what the model is good at.
Concept Intermediate - Post-Training, Fine-Tuning, and Adaptation
Supervised fine-tuning, RLHF, DPO, LoRA, prompt-tuning. How a pretrained model becomes a product.
Concept Intermediate - Model Optimization for Deployment
Quantization, distillation, pruning, KV-cache reuse, speculative decoding. The serving-cost levers that decide unit economics.
Concept Advanced - Large Language Models at Scale
Scaling laws, compute budgets, emergent capabilities, and the cost shape that determines who can train frontier models.
Concept Advanced - Evaluating Large Language Models
Perplexity, MMLU, HumanEval, helpfulness ratings, holistic evals. Why every benchmark is wrong and you still need them.
Concept Intermediate - Multimodal Models
Text + image + audio in one model. CLIP, Flamingo, Gemini, GPT-4o — how cross-modal alignment actually works.
Concept Advanced