Design a Self-Improving Web Agent

How to evolve a WebVoyager-style agent over time: experience replay, failure mining, prompt-update loops, eval harness.

Exercise Advanced
11 min read
exercise web-agent self-improvement evaluation design

Scenario#

You’ve inherited a WebVoyager-style multimodal web agent that’s been running in production for six months. It does what it says on the tin: a user types a goal in natural language (“book me a one-way flight from SFO to NRT on the 14th, cheapest non-stop economy”), the agent drives a headless browser — Playwright under the hood — taking screenshots, calling vision-language reasoning on each page, clicking buttons, filling forms, submitting. It works well enough that the team kept it; it works poorly enough that escalations to human operators are weekly and CSAT is mediocre.

Your mandate from product: make this thing better over time, without us having to retrain the base model every quarter.

What that means concretely:

  • The agent must learn from its own runs. Failure cases today should not be failure cases next month.
  • The improvement loop must be safe — a bad prompt update shouldn’t regress the success rate overnight.
  • The improvement loop must be cheap. Retraining or fine-tuning the base vision-language model is off the table for now; budgets allow for prompt updates, retrieval-corpus updates, and small-scale reward-model training only.
  • It must be measurable. “It feels better” doesn’t ship. Pass-rate on a frozen eval set has to move.

Design the closed-loop system. Architecture, data plumbing, the update mechanism, the eval harness, the safety gates.

Constraints#

What’s fixed:

  • The base model is hosted and you cannot fine-tune it. Updates happen via prompts, retrieved examples, tool-surface changes, or auxiliary models — not weight updates.
  • Production traffic is the only meaningful source of failure data. Synthetic benchmarks are useful for regression but don’t capture the long tail.
  • No site-specific scrapers. The team had a bad experience with brittle per-site DOM selectors. Every action must go through the generic perceive-screenshot → reason → click/type loop.
  • Latency budget per step is 4 seconds end-to-end. A 10-step task budget gives you ~40 seconds; users tolerate up to 90 before churning.
  • Cost ceiling is $0.40 per successful task. Failed tasks may cost more (retries), but the amortised cost has to land there.
  • Privacy. User goals can mention PII (email addresses, credit card numbers, passport info during a booking). The improvement loop must learn from trajectories without persisting raw PII.

What’s variable:

  • The retrieval corpus shape — how many examples, indexed by what, scored by what.
  • The prompt-update mechanism — manual review, LLM-as-judge, reflection-then-rewrite, evolutionary search.
  • The auxiliary models — reward model, failure classifier, screenshot grounding model — can be small and fine-tuneable.
  • The eval harness — composition, weights, frequency.

What’s wishful (don’t assume):

  • That the agent’s own self-critique is reliable. It is not. Reflection helps but is noisy.
  • That site UIs are stable. They are not. A working trajectory from yesterday may not generalise to today’s redesign.
  • That every failure is the agent’s fault. Many failures are environmental — the site rate-limited the IP, the captcha changed, the flight literally sold out mid-booking.

Approach#

A reasonable architecture, in broad strokes:

┌──────────────────────────────────────────────┐
│ Production runs │
│ (WebVoyager-style ReAct loop, Playwright) │
└───────────────────────┬──────────────────────┘
│ trajectory + outcome
┌─────────────────────────┐
│ Trajectory store │ (PII-scrubbed)
│ (screenshots + thoughts│
│ + actions + result) │
└────────────┬────────────┘
┌──────────────────────────────┼──────────────────────────────┐
▼ ▼ ▼
┌────────────────┐ ┌────────────────────────┐ ┌────────────────────┐
│ Failure mining │ │ Successful exemplar │ │ Reward model │
│ (cluster & │ │ index (RAG corpus) │ │ (small, fine-tuned │
│ classify fails)│ │ │ │ on outcomes) │
└───────┬────────┘ └───────────┬────────────┘ └─────────┬──────────┘
│ │ │
└────────────────────────────┼──────────────────────────────┘
┌─────────────────────┐
│ Update proposer │ (suggests prompt /
│ (LLM + reflection) │ tool / retrieval edits)
└──────────┬──────────┘
┌─────────────────────┐
│ Eval harness │ ← frozen + held-out
│ (shadow + canary) │
└──────────┬──────────┘
┌──────────────────────┐
│ Promotion gate │ (human review for big
│ │ changes; auto for small)
└──────────┬───────────┘
┌──────────────────────┐
│ Production config │
│ (versioned, rollback)│
└──────────────────────┘

The closed loop has five stages — observe, mine, propose, evaluate, promote.

  • Observe. Every production run writes a trajectory: ordered list of (screenshot_hash, dom_summary, thought, action, observation) tuples plus the final outcome (success / failure / abandoned / human-handoff). PII is redacted at write time, never persisted raw.

  • Mine. A scheduled job clusters failed trajectories by failure signature — e.g. “got stuck on cookie banner”, “couldn’t find the date picker”, “selected wrong dropdown option”. A small classifier (or LLM-as-judge with a structured rubric) labels each cluster.

  • Propose. For the top failure clusters, an “update proposer” agent generates candidate fixes: a prompt patch (“when you see a cookie banner with accept and reject buttons, prefer…”), a new retrieved exemplar to add to the RAG corpus, or a new helper tool (e.g. dismiss_modal_overlays). Proposals are bounded — one cluster, one fix, one config diff.

  • Evaluate. Each proposal runs against a frozen eval set of 500 tasks plus a held-out set of 200 recent production replays. Pass-rate, average steps-to-success, and average cost are tracked. A proposal that improves the targeted cluster but regresses the frozen set fails the gate.

  • Promote. Small changes (prompt edits, exemplar additions) auto-promote if they pass the gate. Large changes (new tools, structural prompt changes) require human review. Every promotion is versioned; rollback is a single config flip.

Critically, the agent itself is not the proposer. Letting the agent rewrite its own prompts at runtime is a fast track to catastrophic drift. The proposer is a separate process, with its own eval gates, that updates the agent’s config offline.

Design decisions to make#

  1. What’s the trajectory representation? Storing every screenshot is expensive. Storing only thoughts and actions loses grounding context. The pragmatic compromise: store thoughts + actions + DOM-summary text in full, and store screenshots only at decision points (page-transition boundaries) plus the final state.

  2. How is failure detected? Three layers, increasingly costly: (a) the agent’s own self-reported outcome at the end of the run — cheap and unreliable; (b) a downstream system signal — “did the booking confirmation email arrive?” — reliable but slow; (c) an LLM-as-judge replay over the trajectory — moderately reliable, moderately costly. Run (a) live, run (b) when available, run (c) on a sampled basis.

  3. What does the RAG corpus retrieve? Options: full successful trajectories (largest, most signal, expensive at inference), trajectory summaries with key decisions, or just “lessons” extracted from successes (“on this site, the search button is at the top-right”). Lean toward summaries indexed by task type + site domain.

  4. Reward model — is it worth it? A small reward model trained on (trajectory, outcome) pairs can give a faster, cheaper signal than full LLM-as-judge. Useful for ranking proposals and for runtime “should I continue or abort?” decisions. Cost: another model to maintain, and it can mislearn just like any classifier.

  5. How much agent self-reflection at runtime? Reflection-after-failure (let the agent try once, critique itself, retry) helps but doubles cost. Cap at one reflection per task. Don’t loop reflection — diminishing returns past two passes and exploding cost.

  6. Eval-set composition. Three buckets:

    • Frozen regression set. 500 hand-curated tasks that must keep passing. Locked. Never edited except by the eval owner.
    • Recent-production replay set. 200 tasks rotated weekly from real traffic. Catches drift from site changes.
    • Adversarial set. 100 deliberately hard cases — pop-ups, captchas, multi-step bookings, mid-flow errors. Floor here matters as much as ceiling.
  7. Promotion gates. Auto-promote thresholds: cluster pass-rate +3 absolute points, frozen set pass-rate within ±0.5 points, average cost ±5%. Anything outside these bands routes to human review.

  8. Loop control. The agent has a hard step budget (15 steps), a stagnation detector (3 consecutive thoughts touching the same DOM region without progress = abort), and an explicit escape hatch (the agent can call request_human_handoff() rather than continue).

Trade-offs to discuss#

Reflection at runtime. Better recovery on flaky pages. The agent tries, fails, critiques, retries with new insight. Costs: 2x token spend on reflected runs, harder to attribute success to base prompt vs reflection prompt, can mask underlying weaknesses (“we don’t need to fix the base prompt — reflection saves us”).
Reflection deferred to offline. Cheaper at runtime. Reflection happens in the proposer over failed trajectories, distilled into prompt or exemplar updates. Slower feedback loop, but the improvements compound and don’t bloat per-run cost.
Auto-promotion of small changes. Faster iteration, fewer humans in the loop, more updates land per week. Risk: a class of bug only visible in distributional shift over hours of traffic, which the eval harness misses, lands automatically and regresses CSAT before anyone notices.
Human review on every change. Safer, slower. Engineer time is the bottleneck. Most updates are pedestrian and humans rubber-stamp; the gate becomes performative. Hybrid (auto for trivially small, human for structural) is usually correct.

Other axes to surface:

  • Failure mining clustering: by failure signature (what action failed), by task type (what was the goal), by site (where it failed), or jointly? Joint is more informative and harder to act on; per-axis is leakier but easier to drive fixes from. In practice, do all three and let the proposer pick.

  • Exemplar staleness: the RAG corpus is a memory; memories go stale when sites redesign. Need a TTL or recency-weighted retrieval, and a “this exemplar is now misleading” pruning path triggered when retrievals correlate with current failures.

  • Cost vs accuracy frontier: pareto-optimise. A change that lifts pass-rate from 78% to 80% but doubles cost may not be a win; one that lifts 78% to 79% at flat cost may be. Always plot pass-rate against cost-per-success, never one without the other.

  • Privacy in trajectories: a redaction pass at trajectory-write time replaces credit-card-shaped strings, passport-shaped strings, emails, etc. with structural tokens. The exemplar index never holds raw user data. Audit this — a single leaked card number is a much worse incident than a missed booking.

  • Reward hacking: if the reward model says “success = a confirmation page was reached”, the agent learns to navigate to confirmation pages without completing the actual purchase. Guard against this with the downstream signal (did the email arrive, did the charge post) and adversarial eval cases.

  • The “improvement” that’s actually regression to the easy distribution: an update can lift overall pass-rate by trading hard-case pass-rate for easy-case pass-rate. The eval harness needs stratification — pass-rate broken out by difficulty bucket — to catch this.

Evaluation criteria#

A passing answer:

  • Has a closed-loop architecture — observe / mine / propose / evaluate / promote, with each stage’s data flow drawn.
  • Names a concrete trajectory representation — what fields, what cardinality, where stored, with PII redaction.
  • Has an eval harness with frozen + production-replay + adversarial buckets and stratifies pass-rate.
  • Has a promotion gate with explicit thresholds and a rollback story.
  • Acknowledges that the proposer is offline and the runtime agent is immutable per deploy.
  • Names at least one failure mode of the loop itself (reward hacking, exemplar staleness, regression-to-easy, drift).

A strong answer adds:

  • Cost-stratified eval — pass-rate vs cost-per-success as the optimisation target, not pass-rate alone.
  • Subgroup eval — pass-rate by task type, by site, by difficulty, with a “must not regress past X%” floor per subgroup.
  • A failure-mining cadence — how often the mining job runs, how clusters are surfaced, who acts on them.
  • An incident response plan — if a promotion regresses CSAT or pass-rate post-deploy, what’s the detection latency and the rollback path.
  • A “what we’ll never do” list — runtime self-editing prompts, unlogged proposer changes, eval-set leakage into the RAG corpus. The candidate who articulates these is the one who’s run a real continual-learning system.
  • Quantitative goal — “lift pass-rate from 78% to 84% in two quarters at flat cost”, not “make it better”.
The single design decision interviewers keep returning to

The boundary between runtime and offline. Strong candidates draw this boundary explicitly and defend it. Weak candidates blur it: “the agent reads from a corpus that updates itself, and the agent also writes to the corpus, and there’s a reflection loop, and…”. That design has no replay story, no rollback story, and no eval gate — every production call could in principle be operating against a different effective system. The right answer is: the agent at runtime is a frozen artifact. The corpus, the prompts, the tool set — all versioned, all immutable within a deploy. Learning happens at the boundary between deploys, offline, gated. Internalise this and you’ll pass the interview.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.