Generative Pretraining (GPT) — Gen AI

Origin and intuition#

In June 2018, before BERT, Radford and Narasimhan at OpenAI published “Improving Language Understanding by Generative Pre-Training”. The architecture was a 12-layer transformer decoder — causal attention only — pretrained on next-token prediction over the BookCorpus dataset (~800M words). The downstream protocol was the same as ELMo and ULMFiT: fine-tune on the target task with a small added head.

GPT-1 was overshadowed within months. BERT came out in October 2018 and beat it on most benchmarks. For a year it looked like bidirectional encoders were the answer and causal decoders were a stepping stone. Then GPT-2 (Feb 2019, 1.5B parameters) showed that scale changed the equation: the same causal-LM objective, trained on much more data, produced a model that could generate coherent multi-paragraph text and exhibited surprising zero-shot transfer — give it a task description in the prompt and it would attempt the task without any fine-tuning at all. GPT-3 (May 2020, 175B parameters) made this the dominant frame: “few-shot learning via in-context examples”, a single model handling translation, summarization, QA, coding, arithmetic, all by prompting alone.

The architectural insight is simple in retrospect. Next-token prediction over a sufficiently diverse corpus is a universal task. To predict the next token of “The capital of France is” you must learn geography; to predict the next token in a Python file you must learn programming; to predict the next token in a translated sentence pair you must learn translation. As the corpus expands to span the internet, the objective compresses an enormous range of skills into a single set of weights. Scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) showed that loss decreases as a smooth power law in model size, data size, and compute — keep scaling, keep improving, and surprising capabilities emerge at thresholds.

By 2026, every frontier “general-purpose” foundation model — GPT-4, Claude, Gemini, Llama, Mistral, Qwen, DeepSeek — is a decoder-only causal-LM transformer descended directly from the GPT-1 paper’s architecture. The pretraining objective hasn’t fundamentally changed since 2018. The model sizes, dataset sizes, and refinement layers (RLHF, DPO, RLAIF, constitutional methods) have.

Inputs and outputs#

GPT consumes a sequence of token IDs and produces, at every position, a probability distribution over the next token. Concretely:

Input: x_1, x_2, ..., x_n — a sequence of token IDs from a fixed vocabulary (typically 30K-200K BPE/SentencePiece tokens).
Output: at position t, logits z_t ∈ R^V over the vocabulary; softmax gives P(x_{t+1} | x_1, ..., x_t).

During training, the model is run once over the whole sequence with causal masking, and cross-entropy loss is computed at every position. During inference (generation), the model is run autoregressively: produce x_{t+1}, append, run again to produce x_{t+2}, and so on. Two phases at serving time:

Prefill. The prompt is processed in one forward pass. Compute-bound. Builds the initial KV cache.
Decode. Each new token requires one forward pass that reads the KV cache and appends one entry. Memory-bandwidth-bound; this is where the per-token cost of long generations lives.

The same architecture handles essentially any task by reframing it as text-in/text-out: code completion (input is partial code, output is continuation), translation (input is "Translate to French: <sentence>", output is the translation), classification (input is the document plus a label prompt, output is the label token). This is what “general-purpose” means in the GPT lineage — no task-specific architecture or head, just prompts.

Architecture diagram#

A GPT-style decoder-only transformer is a stack of identical blocks. Each block has causal self-attention and a feed-forward, both with residual connections and pre-norm:

Tokens: t_0  t_1  t_2  ...  t_n

   │
   ▼
 Token embedding + Position embedding (or RoPE applied inside attention)
   │
   ▼
 ┌─────────────────────────────────────────┐
 │  Decoder block 1                         │
 │   ┌─────────────────────────────────┐   │
 │   │ LayerNorm                        │   │
 │   │ Causal Multi-Head Self-Attention │   │
 │   │   (mask t_i from attending t_j>i)│   │
 │   └─────────────────────────────────┘   │
 │              + residual                  │
 │   ┌─────────────────────────────────┐   │
 │   │ LayerNorm                        │   │
 │   │ Feed-Forward (d → 4d → d)        │   │
 │   │   (SwiGLU / GeLU activation)     │   │
 │   └─────────────────────────────────┘   │
 │              + residual                  │
 └─────────────────────────────────────────┘
   │
   ▼
 (Decoder block 2)
   ⋮
 (Decoder block L)
   │
   ▼
 LayerNorm + Linear projection to vocab size
   │
   ▼
 Logits over vocabulary at every position
   │
   ▼
 (during training: cross-entropy vs next token at every position)
 (during inference: softmax + sample at the last position)

The causal mask is the entire structural difference from a BERT-style encoder. In attention:

Attention(Q, K, V) = softmax((Q Kᵀ + M) / √d_k) V

where M is a mask matrix with zeros on and below the diagonal and -∞ above. This makes position t unable to attend to positions > t, which lets you train on every position in parallel (each position only sees its causal past, which is fully determined) while still using the model autoregressively at inference.

Model scales across the GPT lineage:

GPT-1 (2018): 12 layers, d_model = 768, 12 heads, 117M params, 512 context.
GPT-2 (2019): 48 layers, d_model = 1600, 25 heads, 1.5B params, 1024 context.
GPT-3 (2020): 96 layers, d_model = 12288, 96 heads, 175B params, 2048 context.
GPT-4 (2023): mixture-of-experts, undisclosed but estimated ~1.7T total params (~280B active per token), 8K-128K context.

Beyond the GPT lineage, Llama 3.1 405B, Claude Opus, Gemini Ultra, and DeepSeek-V3 all use the same fundamental structure with refinements (RoPE, GQA, MoE, longer context windows).

Training objective#

The objective is causal language modeling:

L = - Σ_t log P(x_{t+1} | x_1, ..., x_t)

Sum cross-entropy over every position in every sequence in every batch. This is the entire pretraining objective for GPT-1, GPT-2, GPT-3, GPT-4, and every frontier decoder-only model since. The differences post-2020 are in what comes after pretraining:

Pretraining — causal LM on trillions of tokens of web + books + code + curated data. Produces a “base model” that can complete text but doesn’t follow instructions well.
Supervised fine-tuning (SFT) — fine-tune on high-quality (prompt, ideal response) pairs to teach the model to follow instructions and respond in a useful format.
Preference fine-tuning — RLHF (PPO against a reward model trained on human comparisons), or DPO (direct preference optimization without an explicit reward model), or RLAIF (preferences from another AI), or constitutional methods. This is what produces the “assistant” behavior.
Optional task-specific fine-tuning — for coding, math, tool use, long-context retention. Usually with the same SFT/preference machinery.

The pretraining loss is what builds raw capability. Post-training elicits it and shapes the behavior. The split is so well-established that “base model” vs “instruct model” or “chat model” are now distinct product SKUs.

Scaling laws. Kaplan et al. (2020) showed that with sufficient data, loss decreases as a power law in compute, parameters, and data. Hoffmann et al. (2022, “Chinchilla”) refined this: at fixed compute budget, the optimal recipe trains a smaller model on more data than was conventional — roughly 20 tokens per parameter rather than 2. The Chinchilla finding reshaped frontier model training; modern frontier models train on 5-15× their parameter count in tokens.

The architectural refinements that compound across the modern GPT lineage:

Pre-norm transformer. LayerNorm before each sublayer instead of after. Standard since GPT-2. Required for deep stacks to train without exotic warmup schedules.
RMSNorm. Drop the mean-centering in LayerNorm, keep the scaling. Slightly faster, no quality loss. Standard in Llama-family.
Rotary positional embeddings (RoPE). Apply position as a rotation in query/key space. Extrapolates to sequences longer than seen at training (with tricks like NTK scaling, YaRN). Standard everywhere modern.
SwiGLU / GeGLU feed-forward. Gated linear unit variants. ~1% loss improvement at the same parameter count. Standard in modern open-weights models.
Grouped-query attention (GQA) / multi-query attention (MQA). Heads share key/value projections, reducing KV-cache size at modest quality cost. Llama 2 70B uses GQA with 8 KV heads. Critical for serving cost.
Mixture-of-experts (MoE). Each token routes to k of N expert FFNs. Decouples parameter count from per-token FLOPs. Mixtral 8×7B, DeepSeek-V3 (671B total, 37B active), GPT-4. Frontier-standard.
FlashAttention. Kernel-level rewrite of attention that fuses the softmax and avoids materializing the n × n matrix in HBM. ~3× speedup, dramatically less memory. Required for training and serving past ~8K context.
Long-context refinements. Sliding-window attention, attention sinks, ring attention, position interpolation. The transition from 4K to 1M+ context relied on a stack of these tricks.

Causal (GPT-style) — each position attends only to earlier positions. Enables autoregressive generation. Trains on every position simultaneously via teacher forcing + causal mask. Native fit for text completion, dialogue, code generation.

Bidirectional (BERT-style) — each position attends to every position. Enables rich contextual embeddings. Cannot generate text autoregressively without retrofitting. Native fit for classification, retrieval, ranking, sequence labeling.

Practical considerations#

KV cache is the hidden cost driver. At inference, each generated token requires reading the keys and values of every previous token in every layer. The cache grows linearly with sequence length and is the single largest memory consumer at serving time. For a 70B-class model at 8K context, the per-request cache is multiple GB. GQA, MQA, and paged-attention serving (vLLM-style) are all aimed here.

Throughput shape. Prefill is compute-bound; decode is memory-bandwidth-bound. The two phases have completely different optimal serving strategies. Modern serving stacks (vLLM, TensorRT-LLM, SGLang) batch them separately: continuous batching during decode (new requests join the in-flight decode batch token-by-token), prefill prioritization for short-prompt requests, prefix caching for shared system prompts. Achieving good utilization on both phases is most of the engineering effort in modern LLM serving.

Speculative decoding. Use a small “draft” model to propose k tokens; have the big model verify them in parallel. If the big model agrees with the first j draft tokens, you get j+1 tokens at the cost of one big-model forward pass. ~2-3× throughput in practice. Standard at frontier serving.

Context length tradeoffs. Pretraining context length sets a soft ceiling. You can extend at inference (RoPE scaling, YaRN, position interpolation) but quality degrades past ~4× the training length without continued pretraining. Modern frontier models train with explicit long-context phases: Llama 3.1 was pretrained at 8K then continued at 128K. Claude and Gemini reach 200K-1M with similar staged training.

Hallucination is structural, not a bug to fix. A causal LM trained on next-token prediction has no internal “I don’t know” signal. The objective rewards confident continuation. Mitigations (RLHF refusal training, RAG grounding, calibrated uncertainty, post-hoc verification) reduce but don’t eliminate confabulation. This is the most important shape constraint to internalize about the architecture.

Real-world deployments#

The GPT lineage is the substrate of essentially every general-purpose generative AI product in 2026:

OpenAI GPT-4 / o-series. Chat (ChatGPT, GPT-4 Turbo, GPT-4o), coding (Copilot under the hood), embeddings, fine-tuned variants for enterprise. Mixture-of-experts decoder-only transformer.
Anthropic Claude (Opus, Sonnet, Haiku). Decoder-only transformer with extensive constitutional and RLHF post-training. 200K-1M context windows.
Google Gemini (Ultra, Pro, Flash, Nano). Decoder-only transformer family integrated with Google’s product surface (Search, Workspace, Android, Vertex AI). Native multimodality from pretraining.
Meta Llama 3.x. Open-weights decoder-only transformer; 1B-405B parameter range. The de facto base for most fine-tuning and academic research.
Mistral / Mixtral. European open-weights frontier. Mixtral-family is mixture-of-experts.
DeepSeek-V3 / R1. Chinese frontier, mixture-of-experts (671B total, 37B active), open weights, competitive with GPT-4 on many benchmarks at substantially lower training cost.
Qwen, Yi, Falcon, MPT. Other open-weights lineages, all decoder-only causal LMs.
GitHub Copilot, Cursor, Codeium. Decoder-only LLM-powered code completion. The underlying models are usually fine-tuned descendants of one of the above.
Customer service automation, content generation, data extraction at every Fortune 500 — almost always running fine-tuned or prompt-engineered decoder-only transformers, either via API or self-hosted open weights.

The decoder-only causal-LM architecture has the highest concentration of capability per dollar of compute that the field has ever produced. Whether something better is coming (state-space models, hybrid architectures, world models) is a real open question, but the GPT shape has dominated since 2020 and shows no signs of being displaced soon.

Why next-token prediction is doing more work than it looks like

Skeptics in 2018 dismissed causal LM as “just autocomplete”. The argument was that predicting the next word in a sentence couldn’t possibly require understanding — surely it was statistical pattern matching that would plateau. What actually happened is that to predict the next token reliably across the breadth of human-written text, the model has to implicitly learn syntax, semantics, world knowledge, arithmetic, code execution, multilingual translation, theory of mind, and the structure of arguments. The training signal at each position is weak (one token of cross-entropy), but the breadth of contexts the model encounters means the same weights must satisfy a vast tangled web of constraints simultaneously. The “just autocomplete” framing was correct about the objective and wrong about the implications. Whether scale-and-predict-next-token gets us to AGI is still debated; that it produced GPT-4-class systems would have been a fringe prediction in 2018, and it’s the central fact of the decade.