NVIDIA Eureka — LLM-Driven Reward Design — Agentic

Context#

Reinforcement-learning agents need reward functions: scalar signals that tell the policy whether it’s doing well or badly at each step. For dense robotic tasks — pen-spinning, drawer opening, surgical motions — designing those reward functions has been a famously hard human-expert craft. A good reward function can take weeks of iteration; a bad one teaches the agent the wrong behaviour.

Eureka, published by NVIDIA in late 2023, asked a different question: what if a coding LLM wrote the reward function instead? The result is an agent system that reads a task description, writes candidate reward functions as code, runs RL training against them, evaluates the resulting policies, and iterates — without a human in the inner loop. The system reportedly outperformed human-designed rewards on a substantial fraction of the Isaac Gym benchmark suite, including the now-famous pen-spinning task.

Problem#

The concrete problem statement Eureka tackled:

Input. A task description in natural language (“make this robot hand spin a pen”) and access to the simulation environment’s source code (so the LLM can see what state variables exist).
Output. A reward function (Python code) that, when used to train an RL policy from scratch in the simulator, produces a policy that completes the task.
Constraint. No human-in-the-loop reward engineering. The LLM is the engineer.
Hard parts. Reward functions interact subtly with the policy — bad rewards lead to reward hacking, sparse gradients, or local optima that look like success but aren’t. The LLM can’t observe these in the code; it has to discover them empirically by running RL training.

This is an unusually clean agent design problem because the evaluation (does the RL policy succeed?) is automated. Most agent systems struggle with eval; Eureka inherits a nearly-perfect eval signal from the RL setup.

Architecture#

The Eureka agent is a closed loop over four stages:

                    ┌────────────────────────────────────────┐
                    │                                        ▼
   ┌────────────┐   │ ┌────────────┐   ┌────────────┐   ┌────────────┐
   │  Task      │   │ │ Reward     │   │  RL        │   │ Eval +     │
   │  Description ─┼─►│ Generation │──►│  Training  │──►│ Reflection │
   │  + Env code│   │ │ (LLM)      │   │ (simulator)│   │ (LLM)      │
   └────────────┘   │ └────────────┘   └────────────┘   └─────┬──────┘
                    └────────────────────────────────────────┘ │
                              "candidate critique +            │
                               improvement suggestions"        │
                                                               ▼
                                                       ┌────────────┐
                                                       │ Best       │
                                                       │ reward     │
                                                       │ + policy   │
                                                       └────────────┘

Four stages, in detail#

Reward generation. The coding LLM (GPT-4-class) sees the task description, the simulator’s source code, and (in later iterations) the previous round’s best candidate plus a critique. It writes N candidate reward functions as Python code — typically N = 16 — in a single batched generation. Each candidate is a self-contained function that scores the environment state.
RL training. Each candidate reward function is used to train an RL policy from scratch in parallel (Isaac Gym makes this tractable — thousands of simulation environments per GPU). Training is bounded — fixed step budget per candidate.
Evaluation. Each trained policy is rolled out and scored against a human-defined success metric (separate from the reward function the LLM wrote — this is the ground truth). The best candidate is selected.
Reward reflection. The LLM sees the per-component statistics of each reward term during training — which terms dominated, which were ignored, where the policy plateaued — and writes a critique. The critique seeds the next round’s reward-generation prompt.

The loop runs for several generations (typically 5), with the best reward function from each generation seeding the next.

Key innovations#

Three things make Eureka work beyond the obvious “LLM writes code” framing:

Evolutionary search over the LLM’s output. Generating one candidate and iterating it serially is what most agent papers do; Eureka generates a population of candidates per generation, trains all of them in parallel, and selects the survivors. The diversity matters — reward functions are non-convex, and a single chain-of-iteration often gets stuck. The population gives the search robustness.
Reward reflection as feedback, not just outcome. The naive reflection signal is “did the agent succeed?” Eureka instead feeds back per-component reward statistics during training — how much each reward term contributed to the policy’s behaviour over time. This is much richer feedback than success/failure. The LLM can see that “the velocity term dominated and the position term was ignored” and adjust the next candidate.
Code as the reward surface. By writing rewards as Python — not as a learned model or as a fixed-form parameter vector — the agent gets the full expressiveness of programs: conditionals, decomposition, references to environment state variables by name. Humans designing rewards write code; Eureka gives the LLM the same surface.

A fourth, more subtle innovation: the system doesn’t need fine-tuning. A general-purpose coding LLM, used zero-shot for reward generation and reflection, is enough. The agent design — not a custom model — is what produces capability.

Evaluation#

Eureka was evaluated on 29 tasks across Isaac Gym (dexterous-manipulation, locomotion, navigation). The headline results from the paper:

On 20 of 29 tasks, Eureka’s reward functions matched or exceeded human-expert-designed rewards (measured by policy success rate after the same RL training budget).
On several tasks where human-designed rewards had stagnated, Eureka discovered reward shapes that produced higher-performing policies.
The pen-spinning task, where Eureka trained a Shadow Hand to spin a pen continuously, was widely cited as a result that no prior published reward function had achieved.

The evaluation harness is worth studying separately — most agent systems can’t be evaluated this cleanly. Eureka’s was: a fixed-budget RL training run, a fixed human-defined success metric per task, and per-task reproducibility numbers (the system was run multiple times to measure variance).

Trade-offs and limitations#

What Eureka does well. Dense-reward tasks with observable state variables. Tasks where the simulator’s source code is available (the LLM uses it). Domains where parallel RL is cheap (Isaac Gym, Unity ML-Agents).

Where it struggles. Sparse-reward tasks (the RL signal is too weak for the inner loop to converge in budget). Tasks where the source code is too large to fit in the LLM’s context. Real-robot loops where each trial is expensive — Eureka assumes cheap simulator rollouts.

Other limitations:

Compute-heavy. Each generation runs N candidate RL trainings in parallel. The reported configurations used dozens of GPU-hours per task. This is normal-cost for RL, but it’s not free.
Reward hacking is still possible. A clever reward function can produce a policy that maximises the reward without solving the task. Eureka mitigates this by evaluating against a separate ground-truth success metric — but in domains without such a metric, the safety net disappears.
Closed-loop dependence on the LLM. If the LLM hallucinates an environment variable that doesn’t exist, the reward function crashes. The system handles this with retries and validation but adds latency.
Single-task focus. Eureka generates rewards for one task at a time. Generalising across a family of tasks requires running the loop per task, not a shared cross-task agent.

Lessons#

Things to take away from Eureka for any agent design:

If your task has a built-in verifier, exploit it. Most agent systems struggle with eval; Eureka inherits one for free from the RL setup. Designing your problem so it has a programmatic success signal is half the agent work.
Population search beats serial iteration when the loss surface is non-convex. Diversity of candidates lets you escape local optima. The cost is N× more compute per generation; the win is robustness.
Reflection is much more useful when the feedback is rich. “It failed” is a weak signal. “Reward term X dominated; term Y was ignored; the policy converged at step 80% of budget” is a useful signal. Engineer the reflection input.
LLMs are good engineers for narrow surfaces. Code is the right surface for reward functions because that’s how humans write them. Pick the right output type for your agent — natural language, JSON, code, structured queries — based on what the downstream consumer needs.
Evaluate the agent’s output against a metric the agent doesn’t see. Eureka’s separation of “reward function (LLM-designed)” and “success metric (human-designed)” is the safety against reward hacking. The same principle generalises: don’t let the agent grade its own homework.

Why pen-spinning, specifically?

Pen-spinning is a continuous-contact dexterous-manipulation task where the reward function design space is genuinely hard. Earlier published work hadn’t succeeded at training a Shadow Hand to perform sustained pen-spinning; the reward shapes that work are non-obvious. Eureka’s success there was the headline-grabbing moment — not because pen-spinning matters intrinsically, but because it was a task where the human benchmark was clearly low and LLM-discovered rewards crossed it. The headline reads as “LLMs designing better rewards than humans” but the real story is “evolutionary search in a code-space, with rich reflection signal, beats hand-iteration when the design space is non-convex.”