Architectures
The model architectures behind generative AI — RNN, LSTM, transformer, BERT, GPT, diffusion. Each writeup is a focused deep-dive on one design.
An architecture page is not a paper recap — it's the answer to "what would I draw on a whiteboard if someone asked me to explain this model?" Each writeup follows the same eight-section template: origin and intuition, inputs and outputs, an architecture diagram, the training objective, variants and refinements, practical considerations, real-world deployments, related architectures.
We go in roughly chronological order: RNN → LSTM → seq2seq → attention → transformer → BERT / GPT → diffusion → vision transformers. Each one solves a problem the previous one couldn't.
Key concepts
- Architecture and objective are independent axes — BERT and GPT use almost the same architecture with different training objectives
- Attention is the key innovation — it replaced sequential bottlenecks with parallel routing
- Diffusion models are not transformers — they're a different paradigm (iterative denoising) for a different problem (continuous outputs)
- Encoder-only vs decoder-only vs encoder-decoder — pick by what you're generating, not by what's fashionable
- Every modern frontier model is a transformer variant with tweaks — the question is which tweaks
Reference template
// Eight-section template for every architecture writeup
## Origin and intuition
## Inputs and outputs
## Architecture diagram
## Training objective
## Variants and refinements
## Practical considerations
## Real-world deployments
## Related architectures Adapt to your problem; the structure is the load-bearing part.
Common pitfalls
- Memorizing diagrams without the intuition — the boxes are easy; the *why* of each box is the actual knowledge
- Conflating model architecture with training recipe — same architecture, different data, different model
- Treating attention as magic — it's a learned soft lookup, and the limits show up at long context
- Skipping over residual connections, layer norm, and positional encodings — those are load-bearing
Related topics
Items (8)
- Building Context with Neurons (RNNs)
Vanilla recurrent networks: sequential context, the gradient problem, why they fail past ~50 tokens.
Architecture Intermediate - Reconstructing Context with Sequence Models (LSTM / GRU)
Gated memory cells. How LSTMs and GRUs extended the useful context window from tens to hundreds of tokens.
Architecture Intermediate - Encoder-Decoder Framework
Sequence-to-sequence: an encoder compresses input to a fixed vector; a decoder generates output token-by-token. Translation's first real shot.
Architecture Intermediate - Attention Is All You Need (Transformer)
The 2017 paper that rebuilt the field. Self-attention, positional encoding, parallel training, and why this killed RNNs for language.
Architecture Advanced - Bidirectional Transformers (BERT)
Masked language modeling. How BERT became the encoder of choice for classification, retrieval, and ranking.
Architecture Advanced - Generative Pretraining (GPT)
Causal language modeling at scale. The architectural choice that turned a language model into a general-purpose tool.
Architecture Advanced - Diffusion Models
Iterative denoising as a generative process. The architecture under Stable Diffusion, DALL·E 2, and Sora.
Architecture Advanced - Vision Models (CNN → ViT)
From convolutional layers to vision transformers. How images became sequences and joined the transformer party.
Architecture Advanced