Pretraining Paradigms

Causal vs masked vs contrastive vs span-corruption. The objective you pick determines what the model is good at.

Concept Intermediate
7 min read
pretraining objectives self-supervised transfer-learning

Summary#

Pretraining is the long, expensive phase where a foundation model learns a generic representation of its data — and the objective you pick during pretraining decides what kind of model you end up with. Causal (next-token) pretraining gives you a generator that’s strong at completion and instruction-following. Masked-language modeling gives you a bidirectional encoder that’s strong at classification, retrieval, and ranking. Contrastive pretraining gives you an embedding space where similar things land close together. Span-corruption gives you an encoder-decoder that’s strong at structured rewriting.

Each objective is self-supervised: the labels come from the input itself (the next token, the masked word, the matching image), so you can pretrain on the open web without human annotation. The choice of objective is the most consequential architectural decision after picking the transformer itself — it sets the ceiling on what the adapted model can do.

Why it matters#

Engineers reach for foundation models constantly, and the right model for a task is often determined by which objective it was pretrained on. A causal LM (GPT, Llama, Claude, Mistral) is the right pick for free-form generation. A masked LM (BERT, RoBERTa, DeBERTa) is the right pick for sentence classification or retrieval encoders. A contrastive model (CLIP, SigLIP) is the right pick for cross-modal search. Using the wrong objective for the task means either worse quality, more expensive fine-tuning, or both.

This is also the lens that lets you read the literature. When a new model claims state-of-the-art on a benchmark, the first thing to check is its pretraining objective and corpus — that explains most of what you’re seeing before you look at architectural details.

How it works#

Causal language modeling (CLM): predict the next token#

The objective: given tokens x_1, x_2, ..., x_{t-1}, predict x_t. Loss is the cross-entropy of the predicted distribution against the actual next token. Architecturally, this requires causal masking in the attention layers — position t can only attend to positions 1..t-1, never to the future.

CLM is the backbone of the generative-AI era. GPT-3, GPT-4, Llama-3, Claude-3, Mistral, Gemini, DeepSeek-V3 — all causal LMs at heart. The objective is generative: once trained, you sample from the model autoregressively to produce text. The model learns to model p(x) because predicting the next token requires understanding everything that came before it.

Masked language modeling (MLM): fill in the blanks#

The objective: randomly mask ~15% of tokens in the input, ask the model to reconstruct them. The attention is bidirectional — every position can attend to every other position, including the future. BERT (2018) introduced this; RoBERTa, DeBERTa, and ModernBERT refined it.

MLM produces a model that excels at understanding but not at generation. You can’t easily sample from a masked LM because it has no notion of left-to-right order at inference. What you get is a strong encoder: for classification, retrieval, named-entity recognition, sentence similarity, an MLM-pretrained encoder is still the best foundation, even in 2026.

Span-corruption (T5-style): mask and regenerate spans#

The objective: pick contiguous spans of the input, replace each with a sentinel token, and train an encoder-decoder to regenerate the missing spans concatenated together. T5 introduced this; UL2, Flan-T5, and several code models use variants.

Span-corruption gives you a unified text-to-text interface: classification becomes “input → label-as-text”, translation becomes “input → translated-text”, and so on. The encoder reads, the decoder writes. The trade-off is architectural complexity (encoder-decoder is heavier than decoder-only) and ecosystem fragmentation — most adaptation tooling assumes decoder-only.

Contrastive pretraining: pull matching pairs together#

The objective: given pairs of related inputs (image, caption), train two encoders such that the embeddings of true pairs land close together and mismatched pairs land far apart in a shared embedding space. The loss is InfoNCE — for each anchor in a batch, the positive is its true pair and the negatives are all other items in the batch.

CLIP (image, text), SigLIP, ALIGN, and most modern retrieval embedding models (E5, BGE, Cohere-embed) use this. The result is not a generator but a similarity engine: you can ask “find the image most similar to this caption” without ever generating either.

Denoising-diffusion pretraining: predict the noise#

For image, video, and audio generators, the objective is different again: take a clean sample, add a known amount of Gaussian noise, and train the model to predict the noise that was added. Repeated denoising at inference produces a sample from the data distribution.

Stable Diffusion, DALL·E 2, Sora, and Imagen all use diffusion pretraining. It’s the perceptual-domain analogue of CLM — both are likelihood-based generative objectives, but diffusion’s iterative refinement matches the continuous, high-dimensional nature of pixels better than autoregressive next-pixel prediction would.

Variants and trade-offs#

Causal LM (decoder-only) — best generative quality, simplest serving, supports streaming output. Bidirectional understanding is weaker. Examples: GPT-4, Claude, Llama, Mistral, Gemini.
Masked LM (encoder-only) — best bidirectional understanding for fixed-length inputs (classification, retrieval, ranking). Can’t generate text autoregressively. Examples: BERT, RoBERTa, DeBERTa, ModernBERT.

Other axes:

  • Decoder-only vs encoder-decoder. Decoder-only (GPT family) is simpler to scale and serve, and at sufficient scale it matches encoder-decoder quality on most tasks. Encoder-decoder (T5, Flan-T5, BART) wins on input-conditioning-heavy tasks like translation and summarisation, but the ecosystem has consolidated around decoder-only for new frontier work.
  • Single-objective vs mixture-of-denoisers (UL2-style). Some recent models mix CLM, MLM, and span-corruption in one pretraining run, sampling each objective with some probability. The result is a model that’s competent at both generation and understanding, at the cost of training-time complexity.
  • Data mixture matters more than objective at frontier scale. Llama-3, Qwen-2.5, and DeepSeek-V3 all use CLM — the differentiation between them is overwhelmingly in what they trained on (code share, math share, multilingual share, synthetic data, quality filtering). The objective is the engine; the data is the fuel.
  • Continued pretraining. Instead of pretraining from scratch, start from an open-weights checkpoint and continue pretraining on a domain-specific corpus (medical, legal, code in a niche language). Cheaper than from scratch, more powerful than fine-tuning on the same data.
Why BERT lost to GPT, then partly came back

In 2019–2020, BERT and its variants dominated NLP benchmarks; everyone fine-tuned BERT for classification, retrieval, NER. GPT-3 changed the conversation by showing that a sufficiently large causal LM could do classification via prompting, no fine-tuning required. The field largely abandoned MLM for new frontier work because the generative interface is so much more flexible. But MLM didn’t die — it remained the best objective for embeddings (retrieval, search, RAG), and modern releases like ModernBERT (2024) show the encoder-only line is still alive for the workloads it’s best at. The takeaway: the right objective depends on the workload, not on what’s trending.

When this is asked in interviews#

This is a common mid-loop question on ML and AI-engineering loops, particularly when the team has multiple model types in production (a retrieval encoder and a generative model). The interviewer is checking that you can map a task to an objective rather than reaching for the same model for everything.

What they’re checking:

  1. Can you name the four main objectives and roughly when each wins.
  2. Do you understand why causal masking is needed for CLM, and why MLM doesn’t need it.
  3. Can you reason about a hybrid system — for example, a RAG pipeline that uses an MLM-pretrained encoder for retrieval and a CLM-pretrained model for generation.

Common follow-ups:

  • “Why can’t you just sample from BERT?” — MLM doesn’t define a left-to-right factorization; it conditions on the full context including future tokens, which you don’t have at inference time.
  • “When would you pick an encoder-decoder over decoder-only?” — translation-style tasks where the input is long and structured and the output transforms it; less common in 2026 because frontier decoder-only models match quality.
  • “What does CLIP’s contrastive loss actually optimize?” — InfoNCE, which is a softmax over the in-batch positives and negatives. It approximates mutual information between the two encoders’ outputs.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.