Building a Eureka-Style Reward Loop with ADK

End-to-end implementation: reward generation, evaluation, selection, reflection, and human feedback — wired together as a closed loop.

Implementation Advanced
14 min read
implementation adk reward-design evolutionary-search

Goal#

This writeup walks through implementing a Eureka-style reward-design loop with the Agent Development Kit. The case study — NVIDIA Eureka — LLM-Driven Reward Design — explains what the system does and why it works. This writeup builds it.

You’ll end with a runnable Python project that: takes a task description and an environment spec, generates N candidate reward functions per generation, runs short RL training against each, evaluates the policies against a separate success metric, selects the best, reflects on the candidate’s reward components, and seeds the next generation. The implementation here is small enough to run on a laptop with a toy environment and large enough to scale to real simulators by swapping the inner trainer.

By the end you should understand: how to structure a multi-generation agent loop, how to keep the LLM’s output disciplined enough to compile, how to wire the reflection signal so it actually informs the next round, and where the human-in-the-loop knobs live.

Prerequisites#

Before starting:

  • NVIDIA Eureka — LLM-Driven Reward Design — the case study; the architecture description is referenced throughout.
  • Setting Up and Grounding an Agent — the scaffold below assumes you’ve done a hello-world ADK project.
  • A Python RL stack. This writeup uses Gymnasium with a tiny custom environment so the loop is reproducible on a laptop. In a real setup you’d swap to Isaac Gym, Brax, or your own simulator.
  • An LLM API. Reward generation needs a model that’s strong at code (Gemini 2.5 Pro, GPT-4-class, Claude Sonnet/Opus). Don’t use a tiny model for this step — bad reward functions ruin the whole generation’s worth of training compute.

Step-by-step#

1. Plan the loop#

The control flow we’re building, end to end:

for generation g in 1..G:
1. Build prompt with: task description, env spec, last-gen best reward + reflection
2. LLM generates N candidate reward functions (parallel sampling)
3. For each candidate:
a. Compile and validate
b. Train RL policy for fixed step budget
c. Evaluate policy against task-success metric
d. Log per-component reward statistics
4. Select best candidate by success metric
5. LLM reflects on best candidate's training stats → critique
6. Persist generation results; seed next gen

The generation count G is small — typically 5 to 8. Each generation is the expensive part; the LLM calls are cheap relative to the RL training.

2. Define the environment spec#

The LLM needs to know what state variables exist. In a real setup you’d parse the simulator’s source code; for our toy environment we expose a structured spec.

env/cartpole_balance.py
import gymnasium as gym
import numpy as np
class BalanceEnv(gym.Env):
"""Toy: balance a pole on a cart; minimise pole angle and cart drift."""
# State variable spec the LLM gets to read.
STATE_SPEC = {
"cart_position": "float in [-2.4, 2.4]; episode ends if outside",
"cart_velocity": "float; horizontal velocity",
"pole_angle": "float in radians; vertical is 0.0",
"pole_angular_velocity": "float; positive = falling right",
}
# The task-success metric — separate from the reward; LLM doesn't see this.
TASK_SUCCESS = "fraction of episode steps where abs(pole_angle) < 0.1"
# ... standard Gymnasium env body ...

STATE_SPEC is what the LLM consumes. TASK_SUCCESS is the held-out evaluator — the agent never sees it, so it can’t reward-hack against the metric directly.

3. The reward-generation prompt#

The prompt is the most important piece. It has to: explain the task, list state variables the LLM may use, describe the API the reward must conform to, and (in later generations) include the prior best candidate plus a critique.

prompts/reward_gen.py
REWARD_GEN_TEMPLATE = """You are designing a reward function for an RL agent.
TASK: {task}
The environment exposes these state variables (names you may reference):
{state_spec}
Write a Python function with this exact signature:
def reward(state: dict[str, float], action: float, info: dict) -> tuple[float, dict[str, float]]:
...
return total_reward, component_breakdown
`component_breakdown` is a dict from reward-component name to its
contribution that step. You must include at least 2 named components
and at most 5; the total is their weighted sum.
Constraints:
- Pure function. No globals, no imports beyond `math`.
- All branches must produce a finite float.
- Do not reference state variables outside the list above.
{prior_section}
Return the function as a single Python code block. No prose."""
PRIOR_SECTION_TEMPLATE = """The previous generation's best candidate is below.
Its reward components had these per-step statistics over training:
{stats}
The reflection on this candidate was:
{critique}
Improve on it. The new candidate must be different in structure
(not a parameter tweak) and address the reflection."""

A few prompt-engineering notes:

  • Signature is fixed. The function shape is locked. The LLM only fills the body.
  • Component breakdown is mandatory. This is what the reflection step reads. Without it, reflection has nothing to grip on.
  • “Pure function, no globals” is enforced by the validator. Saying it in the prompt biases the LLM toward compliant code; the validator catches the rest.
  • Prior section is conditional. Generation 1 has no prior; generation 2+ does.

4. The reward sampling step#

ADK lets us run a single agent call that produces multiple samples — set n=N on the model config — but for this loop it’s cleaner to drive N agent invocations in parallel.

reward_gen.py
import asyncio
from google.adk.agents import LlmAgent
from pathlib import Path
GEN_AGENT = LlmAgent(
name="reward-generator",
model="gemini-2.5-pro",
instruction="Generate a reward function. Output Python only.",
)
async def generate_candidates(prompt: str, n: int) -> list[str]:
"""Generate n candidate reward functions in parallel."""
async def one() -> str:
result = await GEN_AGENT.arun(prompt)
return _extract_code_block(result.final_response)
return await asyncio.gather(*(one() for _ in range(n)))
def _extract_code_block(text: str) -> str:
"""Pull the python code block out of the LLM response."""
import re
m = re.search(r"```(?:python)?\s*\n(.*?)```", text, re.DOTALL)
if not m:
raise ValueError(f"no code block in response: {text[:200]}")
return m.group(1).strip()

5. Compile and validate each candidate#

Running LLM-generated code is the part where you need a tight sandbox. For this implementation we use a restricted exec namespace and a static check.

validate.py
import ast
import math
ALLOWED_NAMES = {"math", "abs", "min", "max", "float", "int"}
def validate_reward_source(source: str) -> tuple[bool, str]:
"""Static checks before we run the candidate."""
try:
tree = ast.parse(source)
except SyntaxError as e:
return False, f"syntax: {e}"
if not any(isinstance(n, ast.FunctionDef) and n.name == "reward" for n in ast.walk(tree)):
return False, "missing function named 'reward'"
# Reject imports beyond math
for node in ast.walk(tree):
if isinstance(node, ast.Import) and any(a.name != "math" for a in node.names):
return False, "disallowed import"
if isinstance(node, ast.ImportFrom) and node.module != "math":
return False, "disallowed from-import"
return True, ""
def compile_reward(source: str):
"""Compile and return the reward function or raise."""
ok, msg = validate_reward_source(source)
if not ok:
raise ValueError(f"invalid reward source: {msg}")
namespace = {"math": math}
exec(source, namespace)
fn = namespace.get("reward")
if not callable(fn):
raise ValueError("no callable named 'reward' after exec")
return fn

This is intentionally narrow. Production systems should use a separate process or a sandboxed runtime (gVisor, Firecracker, a constrained Docker). The principle holds: the candidate code never runs in the same address space as the orchestrator.

6. Train RL against each candidate#

Wrap the env so the reward comes from the candidate function. Train with a short budget — the goal is to rank candidates, not produce a final policy.

train.py
import gymnasium as gym
import numpy as np
from stable_baselines3 import PPO
def wrap_env_with_reward(base_env, reward_fn):
"""Wrap a Gymnasium env so its reward comes from `reward_fn`."""
class _Wrap(gym.Wrapper):
def step(self, action):
obs, _, terminated, truncated, info = self.env.step(action)
state = self._obs_to_state(obs)
r, components = reward_fn(state, float(action), info)
info["reward_components"] = components
return obs, r, terminated, truncated, info
def _obs_to_state(self, obs):
return {
"cart_position": float(obs[0]),
"cart_velocity": float(obs[1]),
"pole_angle": float(obs[2]),
"pole_angular_velocity": float(obs[3]),
}
return _Wrap(base_env)
def train_and_eval(reward_source: str, total_steps: int = 50_000):
"""Train a policy with this reward; return (success_score, component_stats)."""
reward_fn = compile_reward(reward_source)
env = wrap_env_with_reward(gym.make("CartPole-v1"), reward_fn)
component_log: dict[str, list[float]] = {}
def _log_components(info):
for k, v in info.get("reward_components", {}).items():
component_log.setdefault(k, []).append(float(v))
model = PPO("MlpPolicy", env, verbose=0)
model.learn(total_timesteps=total_steps, callback=lambda *a, **k: None)
# Evaluate against the held-out task-success metric
success = _eval_policy(model, env)
stats = _summarise(component_log)
return success, stats

The component log accumulates per-step stats across training. After training, _summarise reduces them to a structured dict the reflection prompt can read.

7. Reflect on the best candidate#

The reflection agent reads the best candidate’s source plus its per-component stats and writes a structured critique.

# reflect.py — reflection prompt uses a numbered structure for consistent critiques.
# The template (truncated for readability) embeds the candidate source, per-component
# stats over training, and the task-success score, then asks four numbered questions:
# 1. Which reward components dominated? Were they aligned with the task?
# 2. Which components were ignored (near-zero mean)? Reweight, replace, or remove?
# 3. Any obvious failure mode — saturation, sign flip, division by near-zero?
# 4. The single most important change for the next candidate?
REFLECT_TEMPLATE = (
"You are reviewing a reward function used in RL training.\n"
"Candidate source:\n<<SOURCE>>\n"
"Per-component statistics over training (mean, min, max, std):\n<<STATS>>\n"
"Task-success score: <<SUCCESS>>\n"
"Write a structured critique addressing the four numbered questions above. "
"Reference component names. Keep the critique under 200 words."
)
REFLECT_AGENT = LlmAgent(
name="reward-reflector",
model="gemini-2.5-pro",
instruction="Analyze reward functions; produce structured critiques.",
)
async def reflect(source: str, stats: dict, success: float) -> str:
prompt = (REFLECT_TEMPLATE
.replace("<<SOURCE>>", source)
.replace("<<STATS>>", repr(stats))
.replace("<<SUCCESS>>", f"{success:.3f}"))
result = await REFLECT_AGENT.arun(prompt)
return result.final_response.strip()

Note: the reflection prompt is structured (numbered questions) rather than open-ended (“what do you think?”). The numbered structure produces consistently usable critiques across runs.

8. The outer loop#

loop.py
import asyncio, json
from pathlib import Path
async def run_eureka_loop(
task: str,
state_spec: dict,
generations: int = 5,
candidates_per_gen: int = 8,
out_dir: Path = Path("runs"),
):
out_dir.mkdir(exist_ok=True)
prior_critique = None
prior_best_source = None
history = []
for g in range(generations):
print(f"\n=== generation {g + 1}/{generations} ===")
# Build the prompt
prior_section = ""
if prior_critique:
prior_section = PRIOR_SECTION_TEMPLATE.format(
stats=history[-1]["stats"],
critique=prior_critique,
)
prompt = REWARD_GEN_TEMPLATE.format(
task=task,
state_spec=_format_state_spec(state_spec),
prior_section=prior_section,
)
# Generate N candidates
candidates = await generate_candidates(prompt, candidates_per_gen)
# Train + eval each
scored = []
for i, src in enumerate(candidates):
try:
success, stats = train_and_eval(src)
scored.append({"src": src, "success": success, "stats": stats})
print(f" candidate {i}: success={success:.3f}")
except Exception as e:
print(f" candidate {i}: FAIL ({e})")
if not scored:
print("no candidates trained successfully; aborting")
break
# Pick best
best = max(scored, key=lambda c: c["success"])
print(f" best success: {best['success']:.3f}")
# Reflect
critique = await reflect(best["src"], best["stats"], best["success"])
print(f" critique: {critique[:200]}...")
# Persist
gen_record = {
"generation": g,
"best_success": best["success"],
"best_source": best["src"],
"stats": best["stats"],
"critique": critique,
}
(out_dir / f"gen-{g}.json").write_text(json.dumps(gen_record, indent=2))
history.append(gen_record)
prior_best_source = best["src"]
prior_critique = critique
return history
def _format_state_spec(spec: dict) -> str:
return "\n".join(f"- {k}: {v}" for k, v in spec.items())

9. Human feedback hook#

The system as described has no human in the loop. To add one, gate the next-generation prompt on a human review of the critique:

async def reflect_with_human_review(source, stats, success):
critique = await reflect(source, stats, success)
print("\n--- LLM critique ---")
print(critique)
print("\nEdit the critique below (blank line to accept, or paste a replacement):")
edit = _read_paragraph_from_stdin()
return edit or critique

In production you’d swap stdin for a queue or web UI, but the principle is the same — humans correct or augment the critique, which then seeds the next reward generation. This is the highest-leverage place to put the human, far more so than reviewing each candidate.

10. Run#

Terminal window
uv run python -c "
import asyncio
from env.cartpole_balance import BalanceEnv
from loop import run_eureka_loop
asyncio.run(run_eureka_loop(
task='Balance a pole on a cart; keep cart near origin',
state_spec=BalanceEnv.STATE_SPEC,
generations=5,
candidates_per_gen=8,
))
"

On a CartPole-shaped task this runs in minutes on a laptop. On a real simulator (Isaac Gym / Brax / Mujoco) you’d scale total_steps up and run the candidate trainings in parallel — the loop’s structure doesn’t change.

Code structure#

eureka-adk/
├── pyproject.toml
├── .env
├── env/
│ └── cartpole_balance.py (toy env with STATE_SPEC + TASK_SUCCESS)
├── prompts/
│ ├── reward_gen.py (REWARD_GEN_TEMPLATE)
│ └── reflect.py (REFLECT_TEMPLATE)
├── reward_gen.py (generate_candidates: N parallel LLM calls)
├── validate.py (AST checks + compile_reward)
├── train.py (wrap_env_with_reward + train_and_eval)
├── reflect.py (reflection agent)
├── loop.py (run_eureka_loop: the outer driver)
└── runs/ (per-generation JSON records, gitignored)

Three principles in the layout:

  • The LLM-touching code is concentrated. reward_gen.py and reflect.py are the only files that talk to a model. Everything else — training, validation, environment — is plain Python.
  • The prompts are in their own files. Easy to iterate, diff, and share.
  • Each generation’s record is a JSON file. Reproducible runs need persisted intermediate state. Don’t lose the trajectory; you’ll want to inspect it.

Loop control and exit conditions#

The outer loop has three natural exit conditions:

# Stopping rules to add to run_eureka_loop
EARLY_STOP_SUCCESS = 0.95 # if best > this, we're done
PLATEAU_PATIENCE = 2 # gens of no improvement to stop
best_seen = -float("inf")
gens_without_improvement = 0
for g in range(generations):
# ... train, eval, pick best ...
if best["success"] >= EARLY_STOP_SUCCESS:
print("early stop: success threshold met")
break
if best["success"] > best_seen + 1e-3:
best_seen = best["success"]
gens_without_improvement = 0
else:
gens_without_improvement += 1
if gens_without_improvement >= PLATEAU_PATIENCE:
print("early stop: plateau")
break

A fourth, more important exit condition is the per-candidate training step budget. Spend it generously enough that the policy reveals the reward’s shape; spend it too generously and each generation is hours. Tune by running one generation, looking at the per-component stats, and asking “could a longer run change which reward is best?” If the answer is no, your training budget is already enough.

Common pitfalls#

The pitfalls that hit Eureka-style implementations:

  • Unrestricted code execution. LLM-generated code goes through a sandbox or at least a static check. Skipping this is how you end up with os.system("rm -rf ~") in your training loop.
  • Single-candidate-per-generation laziness. It’s tempting to start with N=1 and “see if it works.” It doesn’t — the LLM gets stuck in a local optimum. Start with N>=4 even on toy tasks; the diversity is what makes the loop work.
  • Reflection on the wrong signal. Telling the reflection agent “success was 0.42” without per-component stats produces vague critiques (“try harder”). Pipe the structured component breakdown; the quality of reflection scales directly with the quality of feedback.
  • Re-running training on a stale env. A candidate that uses a state variable from generation 1 might silently break if generation 3 modifies the env. Pin the env definition for the whole run.
  • No persistence. If the run crashes at generation 4, you lose three generations of work. Persist each gen’s record before starting the next.
  • Reward hacking via the task-success metric. If your task-success metric is observable to the LLM (e.g., it’s in the env spec it reads), the LLM will design rewards that game it. Keep the metric out of the reward generator’s prompt.
  • Tiny model for reward generation. A weak model produces syntactically-correct nonsense rewards. The reward generator is one of the two places (along with the reflection step) where model quality matters most.
Why generate N candidates in parallel rather than serial-iterate one?

The reward-function landscape is non-convex — a small change to a reward can produce a totally different trained policy. Iterating one candidate serially (“read, critique, edit, re-train”) gets stuck on the same gradient. Sampling N diverse candidates per generation explores the space breadth-first; the selection step keeps the best survivor; the reflection step seeds the next breadth-first sweep. The cost is N× train compute per generation; the benefit is robustness to non-convexity. This is the same reason evolutionary algorithms beat single-direction gradient descent on jagged loss surfaces.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.