Attention Is All You Need (Transformer) — Gen AI

Origin and intuition#

In 2017 the dominant language architectures were recurrent: LSTMs and GRUs running one token at a time, carrying a hidden state forward. Recurrence had two structural problems — it was sequential (you couldn’t parallelize across positions during training) and it leaked information through a single fixed-size hidden vector (long-range dependencies degraded). Convolutional alternatives helped with parallelism but only saw a fixed local window per layer.

The transformer’s insight was to drop recurrence entirely and let every position read from every other position directly through a learned similarity function. The mechanism — scaled dot-product attention — costs O(n²) in sequence length but is fully parallel across positions and depth, which on a GPU is a much better trade than O(n) sequential recurrence. Long-range dependencies become a one-hop lookup instead of a many-step relay.

The original paper applied this to machine translation. Within three years it had displaced recurrence as the default for nearly every sequence task, and by 2020 the same architecture (with different training objectives) was running language understanding, language generation, image recognition (ViT), speech, and biology (AlphaFold’s Evoformer).

Inputs and outputs#

A transformer takes a sequence of token IDs and produces, at each position, a contextual vector. What you do with that vector defines the flavour of transformer:

Encoder-only (BERT-family): one stack, bidirectional attention, used for classification or feature extraction. Output is per-token embeddings consumed by a downstream head.
Decoder-only (GPT-family): one stack, causal (left-to-right) attention, used for autoregressive generation. Output at position t is logits over the vocabulary for token t+1.
Encoder-decoder (the original 2017 setup, plus T5, FLAN-T5): two stacks. Encoder is bidirectional, decoder is causal and additionally cross-attends to encoder output. Used for sequence-to-sequence problems with a clear distinction between source and target.

The token vocabulary is typically a sub-word tokenization (BPE, WordPiece, SentencePiece) — large enough to represent words efficiently, small enough to bound the embedding table.

Architecture diagram#

The block, repeated L times, is the unit that matters. Inside one transformer block:

input (n × d)
   │
   ├─► LayerNorm ─► Multi-Head Attention ─┐
   │                                       │
   └────────────────── residual add ◄──────┘
   │
   ├─► LayerNorm ─► Feed-Forward (d → 4d → d) ─┐
   │                                            │
   └────────────────── residual add ◄───────────┘
   │
   ▼
output (n × d)

Two things are doing all the work:

Multi-head attention mixes information across positions. Each head learns its own projection of queries/keys/values, attends independently, and the heads are concatenated and re-projected. Different heads end up specializing — some on local syntax, some on long-range coreference, some on positional patterns.
Feed-forward mixes information across features at each position independently. The 4×-wide intermediate dimension gives this sublayer most of the model’s parameters; in many modern variants this is where MoE routing lives.

The residual connections and pre-norm placement (LayerNorm before the sublayer, not after as in the 2017 paper) are what make a 100-layer transformer trainable. Pre-norm became standard after ~2020 because post-norm requires much more careful warmup to converge at depth.

Scaled dot-product attention, in one line#

For query Q, key K, value V matrices (each shape n × d_k):

Attention(Q, K, V) = softmax(Q Kᵀ / √d_k) V

The √d_k scaling stops the softmax saturating when d_k is large. The softmax row produces an n × n weighting; multiplying by V mixes value vectors according to that weighting. Causal masking sets the upper triangle of Q Kᵀ to −∞ so decoder positions can’t see the future.

Positional encoding — the one place sequence order lives#

Attention is permutation-equivariant: shuffle the tokens and the output shuffles the same way. So you have to inject position information explicitly. The original paper used sinusoidal encodings added to embeddings. Modern models almost all use relative positional schemes instead — RoPE (rotary), ALiBi (linear bias), T5-style relative buckets — because they extrapolate to longer sequences than they were trained on, while sinusoidal absolute positions don’t.

Training objective#

The transformer itself is just an architecture; the objective depends on the flavour:

Causal LM (decoder-only): at each position, predict the next token. Cross-entropy loss against the actual next token. This is GPT-family pretraining.
Masked LM (encoder-only): randomly mask 15% of tokens, predict them from their bidirectional context. This is BERT-family pretraining.
Span corruption (encoder-decoder): mask spans of tokens, predict the missing spans as the decoder target. This is T5/UL2-family pretraining.

The dataset is overwhelmingly the dominant lever. The 2017 paper trained on millions of sentence pairs; a 2025 frontier pretrain is trillions of tokens of filtered, deduplicated web text plus code plus curated domain data. The architecture has barely changed; the data and optimizer recipes are what got better.

Pre-norm + RMSNorm — LayerNorm replaced by RMSNorm (drop the mean-centering, keep the scaling); placed before the sublayer. Trains stably much deeper. Standard in modern open-weights models.

Original post-norm + LayerNorm — the 2017 setup. Works fine to ~24 layers with warmup; brittle past that. Mostly historical now.

Other refinements you’ll see in nearly every modern stack:

Grouped-query attention (GQA) / multi-query attention (MQA). Heads share key/value projections to shrink the KV-cache. Big inference-cost win with minor quality loss.
Rotary position embeddings (RoPE). Apply position as a rotation in the query/key space. Extrapolates better than sinusoidal absolute encodings.
SwiGLU activation in the feed-forward. Replaces GeLU. Slightly better loss for the same parameters.
FlashAttention. A kernel-level rewrite that fuses the attention computation to keep the n × n matrix off HBM. Same math, ~3× faster, dramatically less memory at long contexts. Required for training and inference past ~8K context.
Mixture-of-experts (MoE) feed-forwards. Each token routes to k of N expert MLPs. Decouples parameter count from per-token FLOPs. Now standard at the frontier.
Long-context tricks. Sliding-window attention, attention sinks, hybrid Mamba+attention layers, ring attention for sequence parallelism. The O(n²) cost is the recurring bottleneck.

Practical considerations#

Memory at training is dominated by activations, not parameters. For sequence length n, batch B, hidden size d, layers L, activation memory scales as O(B · n · d · L) — the n is what hurts. Gradient checkpointing trades compute for memory; ZeRO/FSDP shards parameters and gradients across GPUs.

Memory at inference is dominated by the KV cache. Each generated token needs the keys and values from every previous token in every layer. KV-cache size = 2 · B · n · L · d_kv. For a 70B-class model at 8K context, this is many gigabytes per request. GQA, MQA, paged attention, and prefix-sharing across requests (vLLM-style) are all aimed at this.

Throughput at serving comes from batching the prefill (processing the prompt, compute-bound) separately from decode (generating tokens one at a time, memory-bandwidth-bound). Continuous batching, where new requests join an in-flight decode batch token-by-token, is the standard pattern.

Real-world deployments#

The architecture is now the substrate of essentially every deployed foundation model:

Language. GPT-family, Claude, Gemini, Llama, Mistral, Qwen, DeepSeek — all decoder-only transformers with variations on the refinements above.
Vision. ViT (Vision Transformer) replaced ConvNets as the default backbone for image classification by ~2022. CLIP, SAM, and the vision branches of multimodal models all use ViT-style encoders.
Speech. Whisper (encoder-decoder), and modern TTS systems use transformer backbones over audio token codes.
Biology and science. AlphaFold 2/3, ESM protein models, RFDiffusion, weather forecasting models — all transformer-based.

The “transformer everywhere” trend is striking precisely because the architecture wasn’t designed for any of these modalities. Generality came from the abstraction: tokenize anything, give it positional information, attend.

Why the paper title aged this well

“Attention Is All You Need” reads as bravado in 2017 — recurrence and convolution had both been argued essential. The authors meant it literally and architecturally: drop both, keep attention, ship. The title aged into a load-bearing claim about the field. Almost every frontier system since has been a refinement of, not a replacement for, the 2017 block. Mamba and other state-space alternatives chip at it; hybrid architectures borrow from both; but the transformer’s core has held for nine years, which by deep-learning standards is geological.