Reflection and Self-Critique — Agentic

What it is#

Reflection is the pattern where an agent — or a second model acting on its behalf — examines its own output before committing it, identifies problems, and either revises the output or retries the upstream step. It is a feedback loop inserted between producing and acting, and it is one of the most impactful interventions you can add to a vanilla ReAct agent.

The motivation is simple: language models produce plausible-looking output, but plausibility is not correctness. A first-pass answer often has subtle defects — a hallucinated citation, an off-by-one in code, a missed constraint from the prompt, an inconsistent style. Asking the same model “is this actually good?” with the right framing catches a meaningful fraction of those defects. Asking a different model the same question catches more. Asking a non-LLM verifier (a unit test, a schema check, a regex) catches more still. The reflection pattern composes these into a structured second look.

Reflection is sometimes called self-refinement, self-correction, critic-actor, or judge-and-revise, depending on the literature. The shape is consistent: produce, critique, decide, optionally revise.

When to use it#

Reflection earns its compute when at least one of the following holds:

There is an external verifier you can lean on. Tests pass / fail, schema matches / doesn’t, the SQL query parses or errors, the generated image matches the prompt. The verifier doesn’t have to be perfect — it just has to be sharper than the producing model’s self-confidence.
Errors are expensive enough to justify a second pass. A medical-summary mistake, a financial-report typo, a code change that breaks production. The cost of one bad output exceeds the cost of N reflection rounds.
The producing model is operating near its capability frontier. Reflection helps most when the producer is making mistakes a smarter critic could spot; if the producer is solidly within its competence, reflection is mostly waste.
The output has structural requirements the producer often misses. Word counts, JSON schemas, tone constraints, “must mention X and not Y”. A specialised critic enforces these reliably.

Don’t reflect when:

The task is short and well within the producer’s capability. A summarisation of a paragraph, a single-fact lookup, a simple transformation — adding reflection just doubles the cost.
Latency is critical and outputs are streaming to a user. Reflection adds a round trip; in a real-time chat surface, that’s perceptible lag.
The producing model is already the best critic you have. Self-reflection with no external grounding has bounded returns — the model that wrote the answer is the model that thought it was right.

How it works#

The four-block reflection loop#

A clean reflection step looks like this:

Produce. The producer (often the main agent) emits a candidate output.
Critique. The critic — same model with a reviewer prompt, a different model, or a non-LLM verifier — reads the output and emits a structured judgment: pass/fail, defects, suggested edits.
Decide. A small controller decides what to do: accept, revise, retry, escalate. This can be deterministic (any “fail” triggers revision) or itself model-driven.
Revise. If revision is chosen, the producer is re-prompted with the critic’s feedback and emits a new candidate. The loop optionally repeats up to a budget.

The decisions encoded in the controller — accept-on-pass, revise-on-fail, retry-on-error, escalate-after-N — are where most of the engineering effort goes. The producer and critic prompts are the visible part; the controller is the load-bearing part.

Who is the critic?#

Three flavours, each with characteristic costs and benefits:

Same model, different role (self-critique). Cheapest. The same model that wrote the answer is asked, “Here’s the answer. Are there any problems with it? List them.” Works surprisingly well, partly because critiquing existing text is an easier mode for the model than generating fresh text.
Different model (judge model). A second model — often a stronger or differently-trained one — reviews the output. Catches errors the producer is systematically blind to. Costs more per round but raises the ceiling.
Non-LLM verifier. Unit tests, schema validators, linters, fact-checkers, simulators. Cheapest per round (no token cost), sharpest signal (deterministic pass/fail). Usable wherever the task has a verifiable property.

Most production reflection systems combine all three: cheap LLM self-critique on every output, deterministic verifier checks where they apply, and a stronger judge model called only when the cheap checks flag a problem.

The diminishing-returns curve#

Empirically, reflection has a strong first-round effect and a sharp drop-off after that:

Round 1. Big quality lift, often 10–30% on benchmark tasks. The model catches its own obvious mistakes.
Round 2. Smaller lift, often 3–8%. The remaining errors are the ones the model can’t see.
Round 3+. Marginal to zero. The model starts revising things that didn’t need revision, sometimes introducing new errors.

The pattern: budget one reflection round by default, two for high-stakes outputs, three only when each round has a fresh signal (e.g., a new verifier kicks in, or a different critic is used). Beyond that, you are paying tokens for noise.

The shape changes when the critic is external — a unit-test runner, a real user response, an environment observation. Then the loop is grounded in something other than the model’s self-assessment, and longer loops can pay off. Eureka’s reward-design loop is the canonical example: many rounds of reflection are productive because each round is anchored to a fresh RL training run.

Structured vs unstructured critique#

The critic’s output should be structured:

{
  "verdict": "needs_revision",
  "issues": [
    { "type": "hallucination", "location": "paragraph 2", "detail": "citation [3] does not exist" },
    { "type": "constraint_violation", "detail": "exceeds the 200-word budget by 40 words" }
  ],
  "suggested_edits": ["remove paragraph 2's citation; condense paragraphs 3 and 4 into one"]
}

Three things this gets you:

A controller can read it without parsing prose. Decide-step logic is if verdict == 'needs_revision' then revise.
The producer sees a clean issue list on revision. Easier to act on than a paragraph of critique.
Logs are queryable. “Show me all the hallucination-flagged outputs from yesterday” is a SQL query against the verdict JSON.

Free-form critique — “Here are some thoughts on your output” — is harder to act on and harder to analyse later.

Reflection on the loop, not just the output#

Most discussions frame reflection as critiquing the final output. The more powerful framing is trajectory reflection: critique the whole sequence of actions and observations after the loop ends, identify what went wrong, and let the agent retry with that learning in context.

This is what Eureka does for reward design, what experience-replay-style agents do for repeated tasks, and what some web-agent training setups do during eval. The pattern:

Agent attempts a task. Records the full trajectory.
After completion (success or failure), a reflector reviews the trajectory and emits a structured retrospective.
The retrospective is added to the agent’s prompt (or memory) for the next attempt.
Over many attempts, the agent’s effective capability rises without any model retraining.

This is the bridge between “reflection as a per-output pattern” and “reflection as the basis for self-improving agents”. The full self-improving design is covered in Design a Self-Improving Web Agent.

Variants#

The base pattern composes in several shapes:

Self-refinement (same model, same context). The simplest: after producing, prompt “Critique this and revise.” One model, one trace, two passes. Default starting point.
Actor-critic. Two roles in the same model — one prompted as the producer, one as the reviewer. Same compute as self-refinement but with cleaner separation in logs.
Judge-model. A second, separate model reviews the producer’s output. Adds cost; raises ceiling.
Reflexion. A specific published pattern: after each task attempt, the agent generates a verbal reflection on what went wrong, which is appended to its memory for the next attempt. Episodic learning without weight updates.
Constitutional AI / RLAIF style reflection. The critic uses a fixed set of principles (“constitution”) to score the output. Useful when the critic’s standards must be controllable and consistent.
Best-of-N with judge. Producer generates N independent candidates; judge picks the best. Trades latency and cost for quality. Common in code-generation and creative-writing pipelines.
Tree-of-thoughts with backtracking. A more aggressive variant: the producer branches into several candidate paths; a critic evaluates each branch; weak branches are pruned. Compute-heavy; useful for problems where intermediate state can be objectively scored.
Reflection over a window, not per-output. The agent runs for K steps, then a reflector reviews the last K steps and decides whether the agent is making progress. Triggers a re-plan if not. The “stagnation” failure mode in ReAct loops is one this catches.

When self-reflection is harmful — the over-correction failure mode

A failure mode worth knowing: a correct first answer that gets revised into a wrong one. The model, asked “are there problems with this answer?”, will often find some — even when there aren’t any. The cure is to make the critic’s output neutral by default: prompt it to say “no significant issues” when it sees none, and gate revision on a verdict-field rather than the presence of any text in the critique. Empirically, on tasks where the producer is already in its strength zone, naive self-reflection can hurt accuracy by ~5%. Tune the critic prompt, or skip reflection on outputs that pass verifier checks already.

Example systems#

NVIDIA Eureka — the canonical reflection-driven system in this workbook. Reward functions are generated, evaluated in RL training, and reflected upon; the reflection drives the next round of generation. Many rounds work because each is grounded in a fresh training-run result.
MACRS — uses user-feedback-aware reflection: the agent’s recommendation is critiqued in light of the user’s reaction, and the next-turn strategy adapts. The reflection is over conversation trajectory, not isolated outputs.
ChainBuddy — pipeline generation includes a self-check phase where the agent reviews the generated workflow for completeness against the gathered requirements.
Coding agents (Claude Code, Cursor, Aider). After applying an edit, run the tests; on failure, reflect on the test output and revise. The verifier is the test suite — strong, fast, deterministic. This is reflection-with-grounding and works well in this domain.

Trade-offs#

Reflection enabled — higher output quality on most tasks; catches obvious mistakes; legible verdict logs. Roughly 1.5–3x compute per output; latency increase; risk of over-correction; needs tuning to avoid revising what didn’t need it.

No reflection — fastest, cheapest, lowest latency. Accepts the producer’s first pass as-is. Reasonable for outputs that are low-stakes or already strongly verified by downstream consumers.

Other axes:

Reflection budget per task. Hard cap (1 round) vs adaptive (revise until a verifier passes, up to N). Adaptive is more expensive but catches harder errors; hard-capped is more predictable in cost.
Cheap critic always, expensive critic sometimes. A cascading critique pipeline — small fast model first, big slow model only on disputed cases — is usually the right shape. Don’t pay frontier-model prices on every output.
Verifier-first when possible. A unit test is always cheaper and sharper than an LLM critique. Push as much reflection logic as possible into deterministic verifiers; reserve LLM critique for the semantic checks no verifier can do.
Visible-to-user vs invisible. Some products show “Reviewing…” spinners during reflection to set expectations; others hide it. For a coding agent showing iterations is informative; for a chat agent it’s confusing.