← All items

Architectures

The model architectures behind generative AI — RNN, LSTM, transformer, BERT, GPT, diffusion. Each writeup is a focused deep-dive on one design.

8 items 3 Intermediate 5 Advanced

An architecture page is not a paper recap — it's the answer to "what would I draw on a whiteboard if someone asked me to explain this model?" Each writeup follows the same eight-section template: origin and intuition, inputs and outputs, an architecture diagram, the training objective, variants and refinements, practical considerations, real-world deployments, related architectures.

We go in roughly chronological order: RNN → LSTM → seq2seq → attention → transformer → BERT / GPT → diffusion → vision transformers. Each one solves a problem the previous one couldn't.

Key concepts

  • Architecture and objective are independent axes — BERT and GPT use almost the same architecture with different training objectives
  • Attention is the key innovation — it replaced sequential bottlenecks with parallel routing
  • Diffusion models are not transformers — they're a different paradigm (iterative denoising) for a different problem (continuous outputs)
  • Encoder-only vs decoder-only vs encoder-decoder — pick by what you're generating, not by what's fashionable
  • Every modern frontier model is a transformer variant with tweaks — the question is which tweaks

Reference template

// Eight-section template for every architecture writeup
## Origin and intuition
## Inputs and outputs
## Architecture diagram
## Training objective
## Variants and refinements
## Practical considerations
## Real-world deployments
## Related architectures

Adapt to your problem; the structure is the load-bearing part.

Common pitfalls

  • Memorizing diagrams without the intuition — the boxes are easy; the *why* of each box is the actual knowledge
  • Conflating model architecture with training recipe — same architecture, different data, different model
  • Treating attention as magic — it's a learned soft lookup, and the limits show up at long context
  • Skipping over residual connections, layer norm, and positional encodings — those are load-bearing

Related topics

Items (8)

  • Building Context with Neurons (RNNs)

    Vanilla recurrent networks: sequential context, the gradient problem, why they fail past ~50 tokens.

    Architecture Intermediate
  • Reconstructing Context with Sequence Models (LSTM / GRU)

    Gated memory cells. How LSTMs and GRUs extended the useful context window from tens to hundreds of tokens.

    Architecture Intermediate
  • Encoder-Decoder Framework

    Sequence-to-sequence: an encoder compresses input to a fixed vector; a decoder generates output token-by-token. Translation's first real shot.

    Architecture Intermediate
  • Attention Is All You Need (Transformer)

    The 2017 paper that rebuilt the field. Self-attention, positional encoding, parallel training, and why this killed RNNs for language.

    Architecture Advanced
  • Bidirectional Transformers (BERT)

    Masked language modeling. How BERT became the encoder of choice for classification, retrieval, and ranking.

    Architecture Advanced
  • Generative Pretraining (GPT)

    Causal language modeling at scale. The architectural choice that turned a language model into a general-purpose tool.

    Architecture Advanced
  • Diffusion Models

    Iterative denoising as a generative process. The architecture under Stable Diffusion, DALL·E 2, and Sora.

    Architecture Advanced
  • Vision Models (CNN → ViT)

    From convolutional layers to vision transformers. How images became sequences and joined the transformer party.

    Architecture Advanced
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.