The Emergence of Generative AI

What changed in 2017 (attention), 2018 (GPT-1/BERT), 2020 (GPT-3 scale), and 2022 (ChatGPT, productization).

Concept Foundational
9 min read
foundations history generative-models

Summary#

Generative AI did not “emerge” — it crossed four specific thresholds in five years, and each threshold turned what had been a research curiosity into a deployable product. The 2017 transformer paper made sequence modeling parallelizable. The 2018 GPT-1 and BERT papers established the two dominant pretraining paradigms. The 2020 GPT-3 paper showed that scaling alone produced qualitatively new capabilities (in-context learning). The 2022 ChatGPT launch wrapped those capabilities in a chat interface non-experts could use, and the product economy reorganized around the result.

Each threshold required the previous one. None were inevitable in their timing. The chain is short enough that an engineer joining the field in 2026 can read the four original papers and the launch postmortem and have a coherent picture of where we are and why.

Why it matters#

Most engineering decisions in modern AI systems are downstream of choices made in those five years. Why context windows exist. Why fine-tuning works at all. Why pretraining matters more than architecture. Why scaling laws are taken seriously. Why instruction-tuning is a separate step from pretraining. Why RLHF showed up.

The historical detail also tells you something about the next five years. Each of the four thresholds had a binding constraint that broke just before the threshold was crossed: parallelism (broken by attention), task-specific architectures (broken by transfer from pretrained models), data and compute scale (broken by industrial investment), and unfamiliar interface (broken by chat). The current binding constraints are clearer when you can name the previous ones.

For interviews, this is a question that filters surface knowledge from durable understanding. Anyone can name the four releases. Knowing what each one broke, and what would not have happened without the previous one, is the senior-level answer.

How it works#

2017: Attention removed the sequential bottleneck#

By 2016, the dominant architecture for sequence modeling was an encoder-decoder with LSTMs and attention. Attention was already in the picture — Bahdanau (2014), Luong (2015) — but as a mechanism on top of recurrence. The encoder ran an RNN over the input, the decoder ran an RNN over the output, and attention let the decoder look at all encoder states.

The 2017 “Attention Is All You Need” paper proposed removing the recurrence entirely. Self-attention let each position in the sequence directly access every other position. The architecture became a stack of attention plus feed-forward blocks, with positional encoding to inject order information.

Two things changed:

  • Training parallelism. RNN training is sequential within a batch element — token t needs the hidden state from t-1. Transformers can process all positions in parallel during training, which maps perfectly to GPU hardware. Wall-clock training time for the same parameter count dropped by an order of magnitude on the right hardware.
  • Long-range dependencies. RNN gradients struggle to flow more than a few hundred timesteps; attention has a constant-length path between any two positions. Long-range modeling improved immediately.

The paper’s headline result was translation, and the model was modest by today’s standards (200M parameters). But the architecture was the seed. Almost every frontier model since has been a variation of it.

2018: GPT-1 and BERT established the two pretraining paradigms#

Within a year of the transformer paper, two teams independently demonstrated that you could pretrain a transformer on a large unlabelled corpus and then fine-tune it for downstream tasks with much less labelled data than training from scratch required.

GPT-1 (June 2018) used a decoder-only transformer trained on next-token prediction over BookCorpus. Causal masking ensured the model could only see previous tokens. The result was a model that could be fine-tuned for classification, entailment, similarity, and other tasks with surprisingly little task-specific data. The headline insight: generative pretraining transfers to discriminative tasks.

BERT (October 2018) used an encoder-only transformer trained on masked language modeling and next-sentence prediction over BookCorpus and Wikipedia. Bidirectional attention let each token see context in both directions. BERT dominated classification, retrieval, and ranking benchmarks for the next several years.

The split between the two paradigms — causal (decoder-only, generative) and masked (encoder-only, bidirectional) — became the defining axis of NLP architecture for the next five years. Encoder-decoder variants (T5, BART) sat between them.

2020: GPT-3 showed that scale produces new capabilities#

By 2020, the recipe was clear: take a decoder-only transformer, pretrain it on a lot of text, fine-tune for tasks. GPT-2 (1.5B parameters, 2019) had shown that larger models produced better text and learned more from the same data.

GPT-3 (175B parameters, June 2020) was an order of magnitude larger. The paper’s main contribution was not architectural — the architecture was essentially the same as GPT-2. The contribution was demonstrating that at that scale, the model exhibited in-context learning: you could specify a task in the prompt with a few examples, and the model would perform it without any weight updates.

This was a phase change. Pretrain-and-fine-tune required collecting labelled data and running a training job for every new task. Pretrain-and-prompt required collecting a prompt template. The deployment surface widened by orders of magnitude almost overnight.

The Kaplan scaling laws paper (also 2020) made the cost-prediction explicit: loss falls as a smooth power law in compute, parameters, and data. Once organisations trusted the curve, training-budget decisions became financial planning rather than research betting.

2022: ChatGPT productized the capabilities#

InstructGPT (early 2022) demonstrated that fine-tuning a base model on human demonstrations of instruction-following, then using reinforcement learning from human feedback (RLHF) to align the model’s output with human preferences, produced a model that was much better at following user instructions than the base GPT-3 was.

ChatGPT (November 2022) wrapped an InstructGPT-class model in a chat interface and made it free to the public. The launch produced one million users in five days. The product surface — a chat box, a thread of messages, no API key — was the missing piece. The capability had been there for two years; the interface had not.

What this changed:

  • The non-technical audience discovered LLMs. Within months, prompt engineering became a profession, and “AI” as a product category rebuilt itself around generative models.
  • The competitive landscape reset. Every major tech company shipped a chat product within a year. Open-weights chat models followed within two years.
  • The funding environment for AI startups expanded by an order of magnitude. Compute access stratified into “frontier labs” and “everyone else.”

The thresholds before ChatGPT had been technical; the ChatGPT threshold was market. From this point forward, the field’s pace was driven by deployment economics as much as by research.

Variants and trade-offs#

The four-threshold story is one narrative. There are other reasonable framings.

The “architecture matters” framing emphasises the 2017 transformer paper as the singular event. Everything since is variation on that architecture. Under this view, the next breakthrough requires a new architecture, and the field’s current focus on scale is a distraction.
The “scale matters” framing emphasises the 2020 GPT-3 paper and the scaling laws as the singular event. The transformer was necessary but not sufficient; what made the field’s current state possible was the willingness to spend hundreds of millions of dollars on a single training run. Under this view, the next breakthrough requires another order of magnitude of compute.

Both are partially right. The honest answer is that the architecture was necessary (no parallel training, no scale), the scale was necessary (no in-context learning at small sizes), and neither alone would have produced the current state. The interview signal is whether you understand the interaction.

Other historical patterns worth holding:

  • Many parallel breakthroughs were not just transformer-based. Diffusion models (2020 DDPM, 2022 Stable Diffusion) revitalized image generation outside the transformer paradigm. CLIP (2021) bridged vision and language with a contrastive objective. The “transformer is everything” framing is roughly correct for text and increasingly correct for vision, but the broader generative AI story has more than one architectural lineage.
  • The open-weights wave is a separate trajectory. Stable Diffusion (August 2022) was a 700M-parameter open-weights model that ran on consumer hardware. Llama (February 2023) and its successors did the same for language. The open-weights track has consistently lagged the closed frontier by 12-18 months but has democratized access at every step.
  • Reasoning training (2024 onwards) is arguably a fifth threshold. OpenAI’s o1 and successors trained on long internal reasoning traces, opening a “test-time compute” axis that scaling-law analyses did not anticipate. Whether this is a new threshold or a continuation of the scaling story is still being argued.
The papers worth reading in order

If you read four papers and one launch postmortem in chronological order, you will have a coherent picture of how the field got here:

  1. “Attention Is All You Need” (Vaswani et al., 2017) — the architecture.
  2. “Improving Language Understanding by Generative Pre-Training” (Radford et al., 2018) — the GPT-1 paper, generative pretraining.
  3. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (Devlin et al., 2018) — the masked-pretraining paradigm.
  4. “Language Models are Few-Shot Learners” (Brown et al., 2020) — GPT-3, in-context learning, scale.
  5. The InstructGPT paper (Ouyang et al., 2022) plus the public OpenAI blog post on ChatGPT’s launch — productization, RLHF, the move from research to product.

Plus the Kaplan scaling laws paper (2020) and the Chinchilla paper (2022) as essential context. Total reading time: a focused weekend.

When this is asked in interviews#

This is a high-frequency historical-context question on senior AI engineering loops. The interviewer is usually checking whether you can put modern systems in their historical setting, not testing rote knowledge.

The canonical question: “Walk me through how we got from 2017 to ChatGPT.” Three minutes is plenty. A good answer hits the four thresholds, names the bottleneck broken at each, and links to a current engineering decision driven by that history.

Common follow-ups:

  • “What would have been different if BERT had won instead of GPT?” — tests whether you understand the causal-vs-masked distinction and why generation favoured decoder-only.
  • “Why did ChatGPT take off when the underlying model had been available for two years?” — interface, accessibility, and instruction-tuning are the three pieces.
  • “What is the next threshold?” — there is no correct answer; the signal is whether you can reason about current bottlenecks (reasoning depth, agentic reliability, evaluation, training data, energy).

Senior engineering loops sometimes use this question as a proxy for “do you actually read papers?” If you can name authors, dates, and what each paper specifically contributed, you separate yourself from candidates whose understanding came from secondary summaries.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.