What Is Generative AI? — Gen AI · Engineering Playbook

Summary#

Generative AI is the family of machine-learning systems that produce new artefacts — text, code, images, audio, video, structured data — drawn from a learned distribution over examples. The technical line between this and the rest of ML is not the model class but the objective: a generative model fits p(x) (or p(x | y)) so that sampling from the fit produces plausible new x. A discriminative model fits p(y | x) and only labels existing x.

What changed between roughly 2017 and 2022 is the combination of three forces — the transformer architecture, internet-scale pretraining, and a sharp drop in the cost of compute per parameter — which together moved generation from a research curiosity to a deployable engineering primitive.

Why it matters#

Most software you have written before this era assumed the model knows what to predict: spam-or-not, click-or-not, fraud-or-not. Generative AI lets you call out to a primitive that writes the response. That is a different shape of integration: stateless inputs in, free-form artefacts out, with the system’s correctness defined by an evaluation set you build, not by precision-recall on a fixed label space.

The practical implication is that the cost model changes too. A discriminative classifier serves predictions in microseconds; a foundation-model call is tens of milliseconds at best and seconds at worst, with token-billed pricing that maps to user-experience cost almost linearly. Designing around generative AI means designing around this latency and unit-economics shape, not just around accuracy.

How it works#

The objective: model the data, then sample#

A generative model is trained so that the probability it assigns to real examples is high. Once trained, you sample from it:

Autoregressive (GPT-style language models): factorize p(x) = ∏ p(xₜ | x_<t). Sampling means generating one token at a time, each token conditioned on everything previous.
Latent-variable / iterative-refinement (diffusion models for images, audio, video): start from noise and run a learned denoising process for N steps. Each step nudges the sample closer to the data distribution.
Encoder-decoder / sequence-to-sequence: encode input into a latent representation, decode into the target sequence. Still autoregressive on the decoder side; the encoder lets you condition on long inputs cleanly (translation, summarization).

The objective during training is almost always some form of likelihood: maximize the probability of seen data, or equivalently minimize cross-entropy / a noise-matching loss for diffusion.

Pretraining unlocked the scale axis#

The earlier-era pattern was train a model per task. The new pattern is pretrain one foundation model on a huge unlabelled corpus, then adapt it — fine-tune, prompt, or condition it for downstream tasks. The pretraining objective is usually simple: next-token prediction (language), masked reconstruction (BERT), or noise prediction (diffusion). Generality emerges from the volume and variety of the data, not from objective cleverness.

Tokens, embeddings, and the shared interface#

A modern generative system has a few shared substrate pieces regardless of modality:

A tokenizer that maps raw inputs (bytes, audio frames, image patches) to integer IDs.
An embedding table that lifts those IDs into a continuous vector space.
A stack of transformer-style blocks that mix information across tokens via attention and across features via feed-forward layers.
A head that converts the final hidden state back to the output space — a softmax over the vocabulary for text, a continuous velocity field for diffusion, a categorical distribution over codebook entries for many audio models.

The substrate is genuinely shared across modalities now, which is what makes multimodal models — image-and-text in, text-and-image out — buildable at all.

Variants and trade-offs#

Autoregressive language models — strong reasoning and instruction-following; great at sequential structure (code, prose, math). Cost scales linearly with output length. Sampling cannot easily be parallelized within a sequence.

Diffusion models — strong on perceptual fidelity (images, audio, video); naturally bidirectional and easy to condition with classifier-free guidance. Sampling cost is proportional to the number of denoising steps, which is tunable.

Other axes worth knowing:

Open-weights vs. proprietary API. Open-weights models (Llama-family, Mistral, Qwen, DeepSeek) you host yourself: predictable cost, full data control, slower to adopt frontier capability. Proprietary APIs (the major frontier labs’) give you the latest capability and zero ops, but token pricing and rate limits are someone else’s policy and changes with model versions.
Dense vs. mixture-of-experts (MoE). Dense models activate every parameter on every token. MoE models route each token to a subset of “expert” sub-networks, decoupling parameter count from per-token compute. The trade is memory: an MoE needs all experts in VRAM even when only some run.
Generalist vs. specialist. A frontier generalist model is expensive and capable across nearly any task. A specialist (small fine-tuned model, distilled student) is cheap and capable on one task. Most production stacks use both — generalist for hard or novel work, specialist for high-volume, narrow paths.

A short history people skip

Generative models existed before transformers — Markov chains for text (1948-onward), Boltzmann machines (1985), VAEs (2013), GANs (2014), RNN language models (2010s). What was missing was a model that could scale stably. RNNs hit gradient and memory walls; GANs were notoriously hard to train; VAEs blurred. The 2017 transformer broke the scaling barrier for language by replacing sequential recurrence with parallel attention, and the 2020 diffusion-revival (DDPM) did the same for images by reframing generation as iterative denoising. The architectures matter precisely because they made scale tractable.

When this is asked in interviews#

This is the warm-up question on AI-product loops, ML engineering loops, and increasingly on senior backend / platform loops where the team is integrating a foundation-model API.

The interviewer is checking three things in roughly this order:

Do you know what makes a model “generative” technically (modeling p(x) and sampling, not just labeling).
Do you know the engineering shape — latency, token pricing, evaluation harness, hallucinations — not just the marketing.
Can you place a system you’ve built on the right side of the build-vs-buy line: when to call a hosted API, when to fine-tune, when to self-host.

Common follow-ups:

“When wouldn’t you use a generative model?” — classification tasks with clean labels, anything latency-bound under ~50ms, anything where the user must not see hallucinated content and you have no validation step.
“What’s the difference between fine-tuning and prompting?” — fine-tuning changes the weights and survives across calls; prompting changes the context window and applies to one call.
“What broke between traditional NLP and foundation-model NLP?” — task-specific architectures, hand-engineered features, supervised label collection as the bottleneck. All replaced by one general pretrained model and an evaluation set.