Evaluating Large Language Models — Gen AI

Summary#

Evaluating an LLM is harder than evaluating any model class that came before it. The output space is open-ended (free-form text), the correct answer is often one of many, and the things you actually care about — helpfulness, honesty, factuality, safety, style — don’t reduce to a single number. The result is a stack of imperfect signals: perplexity for raw modelling quality, closed-form benchmarks (MMLU, HumanEval, GSM8K, MATH) for capability, preference evaluations (Chatbot Arena, MT-Bench, AlpacaEval) for human-preferred behaviour, and LLM-as-judge for cheap, scalable preference scoring on your own data.

Every benchmark is wrong in some way — saturated, contaminated, gameable, misaligned with what users care about — and you still need them. The discipline is to triangulate across many evals, build domain-specific evals for your own workloads, and treat any single metric as suspect.

Why it matters#

Eval is where most teams underinvest and pay for it later. Without a good evaluation harness, every model change is a guess: did the new fine-tune help or hurt? Did upgrading from one base model to another improve your product? Did the prompt change you made yesterday regress something? You won’t know.

For frontier labs, eval is central to research direction — they’re chasing benchmark numbers because that’s how progress is published and how the field communicates. For product teams, eval is the only way to know that your specific workload — RAG over your docs, your customer-support flow, your code-completion product — is actually better. Public benchmarks correlate with private quality but only loosely. Building your own evals is the most leveraged engineering work on most AI products.

How it works#

Perplexity: the loss you actually trained on#

Perplexity is exp(cross-entropy loss) on held-out text — a measure of how surprised the model is by real data. Lower is better. It’s the natural metric for a language model because it’s the metric the model was trained to optimize, and it correlates monotonically with most downstream capabilities.

Perplexity’s weakness is that it doesn’t capture what you care about once the model is good enough to be useful. A model with perplexity 4.0 isn’t twice as good at math as one with perplexity 8.0; the perplexity-to-utility curve is nonlinear and task-dependent. Frontier labs report perplexity in their papers; product teams almost never use it as a primary signal.

Closed-form benchmarks: capability snapshots#

These are large datasets of questions with known answers, scored by exact match, multiple-choice accuracy, or test-case pass rate:

MMLU (Massive Multitask Language Understanding) — 57 subjects, multiple-choice. The de-facto general-knowledge benchmark; frontier models are at >90%, near the ceiling.
HumanEval / MBPP — code-completion problems, scored by running test cases. The standard for coding capability; frontier models pass ~95%. Successor benchmarks like SWE-Bench Verified evaluate real-world software engineering.
GSM8K / MATH — grade-school and competition math, exact-match scored. Frontier models near-saturated GSM8K; MATH (Olympiad-level) is harder and reasoning-tuned models dominate.
BBH (Big-Bench Hard) — 23 challenging subtasks pulled from the larger BIG-bench, designed to be tasks where models trail humans.
HellaSwag, ARC, TruthfulQA, WinoGrande — older but still cited; mostly saturated by frontier models.

The pattern: every new benchmark is saturated within a few years of release. New harder benchmarks (GPQA-Diamond, ARC-AGI, FrontierMath, SWE-Bench-Verified) replace them. By 2026, benchmark saturation is happening faster than benchmark creation.

Preference evaluations: what humans actually prefer#

Capability benchmarks measure can the model do it. Preference evals measure does the user like the output. These are the metrics that matter for chat products.

Chatbot Arena (LMSys) — pairwise human comparisons, ELO ratings, public leaderboard. The gold standard for “humans prefer this model” but slow and expensive to refresh.
MT-Bench, AlpacaEval, Arena-Hard — automated preference evals using a strong LLM as judge. Fast, cheap, decent correlation with human preferences.
Helpfulness, harmlessness, honesty (HHH) ratings — Anthropic’s framework, expanded in practice to many product-specific axes (conciseness, format-following, refusal-appropriateness).

Preference evals are themselves biased — toward longer answers, toward styles the judge model prefers, toward sycophantic agreement. Good preference eval design controls for these (length-normalisation, multi-judge ensembles, explicit anti-sycophancy prompts).

LLM-as-judge: the cheap eval that quietly took over

Running a strong LLM (GPT-4-class or Claude-3-class) as a judge over your own outputs has become the default eval pattern for product teams. You give the judge a rubric and two outputs (A vs B, or output vs reference), and ask which is better and why. It’s ~100x cheaper than human judges and reaches ~85% agreement with human raters on most tasks. The failure modes are real — judge models have stylistic preferences, length bias, position bias (the first answer wins more often), and self-preference (a judge model prefers outputs from its own family). The fixes — randomising position, using multiple judges, defining narrow rubrics — are well-known. For most product evals in 2026, LLM-as-judge is the right tool, ideally calibrated against human ratings on a small reference set.

Holistic and capability-specific evals#

Beyond the standard benchmark suites, mature eval stacks include:

Domain-specific evals — your RAG pipeline on your docs, your customer-support agent on past tickets, your code completion on your codebase. The most important kind.
Safety / red-team evals — prompts designed to elicit harmful, biased, or jailbroken outputs. Run as part of release gating.
Robustness evals — prompt rephrasings, adversarial inputs, distribution shift. A model that scores 95% on MMLU but drops to 60% with minor rephrasings is fragile.
Long-context evals — needle-in-haystack tests (find a fact at position N of a 100K-token context), RULER, LongBench. Most models degrade significantly past 32K-64K real-world context, despite advertised much-longer windows.
Tool-use and agentic evals — SWE-Bench, WebArena, AgentBench. Increasingly load-bearing as products move from chat to agents.

Eval-driven development#

The mature workflow: keep a frozen eval set, run it before and after every model or prompt change, track regression by category. Treat eval failures the same way you treat unit-test failures. Frontier labs do this with massive eval harnesses (>1000 benchmarks); product teams should do a scaled-down version of the same.

Variants and trade-offs#

Closed-form benchmarks — fast, cheap, reproducible, comparable across models. Saturate quickly, can be contaminated (model trained on the test set), don’t capture open-ended quality.

Preference and LLM-as-judge evals — capture open-ended quality, scale to product-specific data, track what users actually care about. Subject to judge bias, harder to reproduce across model families, more expensive than closed-form.

Other axes:

Eval contamination. Public benchmarks leak into training data; a model might be memorising the answers rather than reasoning about them. Frontier labs disclose decontamination efforts; researchers periodically release “private” or “canary” benchmarks (GPQA-Diamond, FrontierMath) that are kept out of training corpora.
Length bias. Many evals (both human and LLM-as-judge) prefer longer answers, often unjustifiably. Length-controlled evaluations (AlpacaEval-LC) correct for this.
Position bias. When comparing two outputs side by side, the first one tends to win more often. Fix: randomise order and average.
Self-preference. GPT-as-judge prefers GPT outputs; Claude-as-judge prefers Claude outputs. Fix: judge with a different family, or use multiple judges and average.
Static vs adaptive evals. Static evals are fixed datasets; adaptive evals (Dynabench-style) generate hard examples on the fly. Adaptive evals fight saturation but are harder to compare longitudinally.

The benchmarks-saturate-faster-than-they're-made problem

In 2018, BERT smashed GLUE; by 2020 GLUE was retired and SuperGLUE was the bar. By 2022, SuperGLUE was saturated. MMLU (2020) was at ~25% (random) for years; by 2024 frontier models were at >90%. HumanEval was below 30% in 2021 and is now ~95%. The pattern repeats: a benchmark becomes the standard, models saturate it within 2–4 years, a new benchmark replaces it. By 2026, the frontier is in benchmarks that didn’t exist 18 months earlier (FrontierMath, SWE-Bench-Verified, ARC-AGI, HumanEval-V2). The practical implication: if your eval suite is more than a year old, half of it probably isn’t measuring anything meaningful at the top of the leaderboard anymore.

When this is asked in interviews#

This is a frequent mid-to-senior question on AI-engineering and AI-product loops, particularly when the team has had to defend or attack model choices. The interviewer wants to see eval as a system, not as a single benchmark.

What they’re checking:

Can you name the main public benchmarks and what each measures (MMLU = general knowledge, HumanEval = code, MATH/GSM8K = math, Chatbot Arena = preference).
Do you understand the failure modes — saturation, contamination, length bias, judge bias.
Can you design a product-specific eval set, given a description of the product, including offline evals (frozen test set) and online evals (production traffic monitoring).

Common follow-ups:

“Why don’t you trust public benchmarks?” — saturation, contamination, narrow coverage, misalignment with your actual workload. Public benchmarks tell you which model class is in the right ballpark; private evals tell you which one to ship.
“How would you evaluate a RAG system?” — separate retrieval eval (recall@k on a set of query-doc pairs) from generation eval (LLM-as-judge or human ratings on whether the response is grounded in retrieved context), end-to-end on a fixed eval set.
“How do you catch regressions when you change a prompt?” — frozen eval set, run before and after, gate on no category regressing more than X%. Treat prompts like code.