Post-Training, Fine-Tuning, and Adaptation

Supervised fine-tuning, RLHF, DPO, LoRA, prompt-tuning. How a pretrained model becomes a product.

Concept Intermediate
8 min read
fine-tuning rlhf dpo lora post-training

Summary#

A freshly-pretrained language model is a next-token predictor — capable but unhelpful. It will happily complete “How do I reset my password?” with a thousand more user questions instead of answering one. Post-training is the collection of techniques that turns this raw predictor into something a user can talk to: supervised fine-tuning (SFT) on instruction-response pairs, preference optimization (RLHF, DPO, IPO, KTO) to align with human preferences, safety fine-tuning to enforce refusal behaviour, and tool-use training so the model can call functions reliably.

Adaptation is the broader umbrella that also includes lightweight per-customer techniques: LoRA / QLoRA for parameter-efficient fine-tuning, prompt-tuning / prefix-tuning for the lightest-weight customisation, and continued pretraining for heavier domain shifts. Frontier labs run the full post-training stack; product teams typically run adaptation on top of an already-aligned base.

Why it matters#

Every chat product you use — ChatGPT, Claude, Gemini, Llama-Chat — is a base model plus a post-training stack. The base model’s pretraining loss is the same shape as it was three years ago; what changed is the post-training. RLHF (2022) turned GPT-3.5 into ChatGPT; DPO (2023) made the recipe reproducible without a reward model; refusal training and constitutional AI (2023–2024) made the models safer to deploy.

For engineers building on top, the question is rarely “should I pretrain?” (almost never) but “should I fine-tune, and if so, how?” Knowing the difference between full SFT, LoRA, and prompt-tuning — and when each beats prompting alone — is the load-bearing skill. The wrong choice burns weeks of GPU time and ships a model worse than the off-the-shelf version it replaced.

How it works#

Supervised fine-tuning (SFT): instruction-following#

After pretraining, the first post-training step is SFT on a curated dataset of (instruction, response) pairs. The loss is still next-token cross-entropy, but the data has shifted from “the open web” to “high-quality demonstrations of helpful behaviour”. A few hundred thousand pairs is usually enough; data quality matters far more than quantity.

SFT teaches the model the format and register of a helpful assistant. It also teaches it to follow templated instructions (“answer in JSON”, “be concise”, “explain step by step”). The model’s capabilities don’t grow much during SFT — capabilities come from pretraining — but its behaviour shifts dramatically.

Preference optimization: RLHF, DPO, and successors#

SFT alone produces a model that’s helpful but not always preferred. The next step is preference optimization: train on pairs of responses where humans (or another model) said “A is better than B”.

RLHF (Reinforcement Learning from Human Feedback) is the original recipe (InstructGPT, 2022): train a reward model on preferences, then fine-tune the policy with PPO to maximize the reward model’s score, with a KL penalty keeping the policy close to the SFT initialization. RLHF works but is operationally complex — you have two models in flight (reward and policy), training is unstable, and PPO has many hyperparameters.

DPO (Direct Preference Optimization, 2023) eliminated the reward model. The DPO loss is a closed-form expression of the same objective: increase the log-probability of preferred responses, decrease it for rejected ones, scaled by the policy’s KL divergence from the reference. Same goal, no reward model, more stable.

Variants (IPO, KTO, ORPO, SimPO) tweak the DPO loss further — different KL penalties, ability to learn from unpaired data, removal of the reference model. Most production stacks use DPO or one of its descendants in 2026.

Parameter-efficient fine-tuning (PEFT): LoRA and friends#

Full fine-tuning updates every parameter of the model — expensive in compute, memory, and storage (you end up with a full new model per task). PEFT methods update only a small fraction of parameters or add a small adapter.

LoRA (Low-Rank Adaptation) is the dominant variant: for each linear layer you want to adapt, learn a low-rank update ΔW = B · A where A and B are small matrices (rank typically 4 to 64). The original weights stay frozen. A LoRA adapter for a 70B-parameter model might be only ~500MB instead of ~140GB, and at inference you can swap adapters per request.

QLoRA combines LoRA with 4-bit quantization of the base model: the frozen base is quantized down to NF4, the LoRA adapters stay in bfloat16, and you can fine-tune a 70B model on a single 48GB GPU. Prompt-tuning and prefix-tuning go even lighter — they learn a small set of “virtual token” embeddings that get prepended to every input, no model weight changes at all. Cheap, fast, fragile.

Continued pretraining and domain adaptation#

Sometimes the gap between the base model and your target domain is too large for SFT to bridge — e.g., legal contracts in a specific jurisdiction, code in a niche language, medical notes. Continued pretraining: take an open-weights checkpoint and run more next-token-prediction on a domain-specific corpus before SFT. Cheaper than training from scratch, more powerful than SFT alone.

The order is important: continued pretraining first (instil domain knowledge), then SFT (instil task format), then preference optimization (instil preferred behaviour). Reversing the order undoes earlier stages.

Variants and trade-offs#

RLHF (PPO-based) — battle-tested, what the original ChatGPT used. Operationally complex: separate reward model, PPO instability, careful KL tuning. Still preferred at frontier labs for the highest-quality post-training.
DPO and successors — closed-form loss, no separate reward model, much simpler infrastructure. Slightly less expressive than RLHF on the hardest preference signals but close enough that it’s the default for most open-weights and product fine-tunes.

Other axes:

  • Full fine-tune vs LoRA. Full fine-tune wins on quality when you have ample compute and a large, high-quality dataset. LoRA wins on cost, deployment simplicity (one base + many adapters), and forgetting (the base capabilities are protected). For <100K examples, LoRA is almost always the right call.
  • Prompt-tuning / prefix-tuning is the lightest weight and most fragile. Useful when you have very little data and just need to nudge style or format. Don’t expect it to teach the model anything new.
  • Constitutional AI (Anthropic’s approach) replaces some human-preference data with model-generated critiques against a written constitution. Scales cheaper than human feedback, requires careful constitution authorship.
  • Reasoning fine-tuning (the o1 / DeepSeek-R1 line) adds a chain-of-thought generation step to the post-training mixture, often using RL on verifiable reasoning tasks (math, code) rather than human preferences. The frontier post-training pattern in 2026.
The 'fine-tune destroyed my model' failure mode

A common, painful failure: someone runs SFT on ~1000 domain examples and the resulting model is worse than the base on everything including the target task. The diagnosis is usually catastrophic forgetting — the fine-tune ran too many epochs at too high a learning rate, the model overwrote its pretrained capabilities. Fixes: lower learning rate (1e-5 to 5e-5 for full FT, higher for LoRA), fewer epochs (1 to 3), early stopping on a held-out validation set, mix in some general instruction data with your domain data (1030% general), or just use LoRA which protects the base weights by construction. The fact that this is a common interview red-flag question tells you how often it bites teams in production.

When this is asked in interviews#

This is the most common mid-to-senior question on AI-product and ML-engineering loops, because it’s where most teams have to make real decisions. The interviewer wants to see that you can reason about adaptation as a cost-benefit problem, not just name techniques.

What they’re checking:

  1. Do you know the post-training stack (SFT → preference optimization → safety) and can you place RLHF / DPO / LoRA on it.
  2. Can you pick the right adaptation level for a given problem — prompt engineering vs LoRA vs full fine-tune vs continued pretraining.
  3. Do you understand the failure modes — alignment tax, catastrophic forgetting, reward hacking, distribution shift between fine-tune data and production traffic.

Common follow-ups:

  • “When does fine-tuning beat a longer prompt?” — when latency budget rules out long contexts, when you have >1000 consistent examples, when the task has stable structure that survives many turns.
  • “Why is DPO so much more popular than RLHF in open-weights work?” — simpler ops, no separate reward model, more reproducible. Frontier labs still use RLHF or hybrid stacks because they have the infrastructure for it.
  • “How do you evaluate a fine-tune?” — held-out set for the target task, plus a regression suite (MMLU, HumanEval, a small instruction-following set) to catch capability erosion. The regression suite is where teams get caught.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.