Architectures

The model architectures behind generative AI — RNN, LSTM, transformer, BERT, GPT, diffusion. Each writeup is a focused deep-dive on one design.

8 items 3 Intermediate 5 Advanced

An architecture page is not a paper recap — it's the answer to "what would I draw on a whiteboard if someone asked me to explain this model?" Each writeup follows the same eight-section template: origin and intuition, inputs and outputs, an architecture diagram, the training objective, variants and refinements, practical considerations, real-world deployments, related architectures.

We go in roughly chronological order: RNN → LSTM → seq2seq → attention → transformer → BERT / GPT → diffusion → vision transformers. Each one solves a problem the previous one couldn't.

Key concepts

Architecture and objective are independent axes — BERT and GPT use almost the same architecture with different training objectives
Attention is the key innovation — it replaced sequential bottlenecks with parallel routing
Diffusion models are not transformers — they're a different paradigm (iterative denoising) for a different problem (continuous outputs)
Encoder-only vs decoder-only vs encoder-decoder — pick by what you're generating, not by what's fashionable
Every modern frontier model is a transformer variant with tweaks — the question is which tweaks

Reference template

// Eight-section template for every architecture writeup
## Origin and intuition
## Inputs and outputs
## Architecture diagram
## Training objective
## Variants and refinements
## Practical considerations
## Real-world deployments
## Related architectures

Adapt to your problem; the structure is the load-bearing part.

Common pitfalls

Memorizing diagrams without the intuition — the boxes are easy; the *why* of each box is the actual knowledge
Conflating model architecture with training recipe — same architecture, different data, different model
Treating attention as magic — it's a learned soft lookup, and the limits show up at long context
Skipping over residual connections, layer norm, and positional encodings — those are load-bearing

Items (8)

Building Context with Neurons (RNNs)
Vanilla recurrent networks: sequential context, the gradient problem, why they fail past ~50 tokens.

Architecture Intermediate
Reconstructing Context with Sequence Models (LSTM / GRU)
Gated memory cells. How LSTMs and GRUs extended the useful context window from tens to hundreds of tokens.

Architecture Intermediate
Encoder-Decoder Framework
Sequence-to-sequence: an encoder compresses input to a fixed vector; a decoder generates output token-by-token. Translation's first real shot.

Architecture Intermediate
Attention Is All You Need (Transformer)
The 2017 paper that rebuilt the field. Self-attention, positional encoding, parallel training, and why this killed RNNs for language.

Architecture Advanced
Bidirectional Transformers (BERT)
Masked language modeling. How BERT became the encoder of choice for classification, retrieval, and ranking.

Architecture Advanced
Generative Pretraining (GPT)
Causal language modeling at scale. The architectural choice that turned a language model into a general-purpose tool.

Architecture Advanced
Diffusion Models
Iterative denoising as a generative process. The architecture under Stable Diffusion, DALL·E 2, and Sora.

Architecture Advanced
Vision Models (CNN → ViT)
From convolutional layers to vision transformers. How images became sequences and joined the transformer party.

Architecture Advanced

Architectures

Key concepts

Reference template

Common pitfalls

Related topics

Items (8)

Keyboard shortcuts