What Are Foundation Models? — Gen AI

Summary#

A foundation model is a single, broadly-pretrained model that you adapt to many downstream tasks instead of training a fresh model per task. The defining properties are scale (billions of parameters, trillions of training tokens), a generic self-supervised objective (next-token prediction, masked reconstruction, denoising), and transfer — the same weights serve summarization, code completion, classification, retrieval, and dozens of other tasks with little or no extra training.

The term was popularised in 2021 by the Stanford CRFM report, but the engineering pattern was already in flight: GPT-3 (2020) had shown that one 175B-parameter language model, prompted appropriately, beat task-specific models across a wide benchmark suite. Today’s GPT-4-class, Claude-3-class, Llama-3-class, and Gemini-class models are the same idea industrialised — one expensive pretraining run, many cheap adaptations on top.

Why it matters#

Before foundation models, shipping an ML feature meant a per-task pipeline: collect labels, choose an architecture, train, tune, deploy, monitor. Cost was dominated by labelled data. After foundation models, you start from a pretrained checkpoint that already encodes most of the world’s text (or images, or code) and pay only for adaptation — sometimes just a prompt, sometimes a small fine-tune.

That changes the economics of every team that uses ML. A two-engineer team can now ship features that previously needed a 10-person ML org and a labelling pipeline. The trade-off moves from can we collect enough labels to can we afford the inference cost and can we evaluate the output. The bottleneck shifted from data collection to evaluation harnesses, prompt engineering, and serving infrastructure — which is why those skills are now load-bearing on most product teams.

How it works#

One model, many tasks via the prompt-or-adapt interface#

The foundation-model pattern has two phases. Pretraining: a single huge run on a generic corpus with a self-supervised loss. Adaptation: a cheap step per downstream task. Adaptation can be as light as writing a prompt, or as heavy as a full fine-tune on a labelled set — but the per-task cost is orders of magnitude smaller than training from scratch.

The interface that makes this work is the prompt: a natural-language description of the task, often with a few examples (few-shot). Because the pretrained model has seen enough text to recognise task instructions, “translate the following to French” or “extract the JSON” routes the same weights through different behaviours. This is the substrate-and-skin pattern — one expensive substrate, many cheap skins.

What pretraining actually produces#

A pretrained foundation model is a learned distribution p(x) over its training data. For a language model, every token of the corpus contributed gradient signal pointing the parameters toward a configuration that assigns high probability to the kind of text humans write. The side effects of fitting this distribution include:

World knowledge — facts about geography, history, science, code APIs.
Linguistic competence — grammar, idiom, register, multilingual transfer.
Latent skills — arithmetic, logical inference, planning, simple coding. These are not trained explicitly; they emerge because predicting the next token of a math textbook requires the model to do math implicitly.

This is why a single model can do tasks it was never explicitly trained on: predicting code requires the model to understand code; predicting an argument’s conclusion requires it to follow the reasoning.

The architectural commitment#

Foundation models are almost exclusively transformers — decoder-only for language (GPT, Llama, Mistral, Claude, Gemini), encoder-decoder for translation-flavoured tasks (T5, Flan-T5), and a transformer + diffusion stack for images and video. The architecture matters less than the recipe at this point: most frontier labs converged on rotary positional embeddings, grouped-query attention, SwiGLU activations, and RMSNorm — small refinements on a stable base.

The real differentiation is in data curation, training mixture, post-training (instruction-tuning + preference optimisation), and infrastructure. Two labs running the same architecture at the same scale on different data produce noticeably different models.

Variants and trade-offs#

Frontier proprietary models (GPT-4-class, Claude-3-class, Gemini-Ultra) — strongest reasoning, instruction-following, and multimodal capability. Behind APIs, expensive per-token, opaque about training data. You pay for capability and lack of ops.

Open-weights models (Llama, Mistral, Qwen, DeepSeek, Gemma) — weights downloadable, hostable, fine-tunable. Typically 6–18 months behind frontier on the hardest benchmarks but close on most production tasks. You pay in GPUs and platform engineering.

Other practical axes:

General vs. domain-specialised. A frontier generalist will outperform most domain-specific models on most benchmarks, but a code-specialised model (DeepSeek-Coder, Codestral) or a medical-specialised model can beat it on the narrow task while being 10x smaller. Speciality wins when the domain has its own conventions that aren’t well-represented in generic web data.
Dense vs. mixture-of-experts (MoE). Dense (Llama-3, Claude) activates every parameter on every token. MoE (Mixtral, DeepSeek-V3, GPT-4-rumoured) routes each token to a few experts out of many, decoupling parameter count from per-token compute. MoE wins on capability-per-FLOP; dense wins on memory footprint and serving simplicity.
Text-only vs. multimodal. Text-only models are cheaper and faster. Multimodal models (Gemini, GPT-4o, Claude-3) accept and sometimes produce images, audio, video. Multimodality is rapidly becoming the default at the frontier.

The 'foundation model' name was controversial

When the Stanford CRFM report introduced the term in 2021, some researchers pushed back — arguing it overstated the depth of these models’ understanding and risked enshrining a vendor pattern that hadn’t yet proven safe at scale. The criticism didn’t change the engineering reality (the pattern works and the economics favour it), but it shaped the alignment and safety subfield: red-teaming, capability evaluations, and refusal training all matured partly because the foundation-model framing forced the question “what is this base layer responsible for?”

When this is asked in interviews#

This is a baseline question on AI-product, ML-engineering, and increasingly platform-engineering loops. The interviewer wants to see that you can talk about foundation models as an engineering pattern, not a marketing term.

Three things they’re checking:

Do you understand the pretrain-then-adapt loop, and can you place a real product on it (which model, which adaptation, which evaluation).
Do you understand the cost shape — that pretraining is a one-time fixed cost amortised across all users, while inference is a per-call variable cost that maps almost linearly to revenue.
Can you reason about build-vs-buy: when does a hosted API win, when does an open-weights model + fine-tune win, when does training from scratch ever make sense (almost never, but you should know the exceptions: highly proprietary data, regulatory isolation, frontier research).

Common follow-ups:

“What’s the difference between a foundation model and a large language model?” — LLM is a model class; foundation model is the role. Vision and audio foundation models exist too.
“Why can’t I just train my own?” — pretraining a frontier-class model is a >$10M compute spend plus a team of ~50 researchers plus a year. The economics demand reuse.
“When does fine-tuning beat prompting?” — when you have >1000 high-quality labels, when latency budget rules out long prompts, when the task has stable structure that survives across calls.