Building a Multimodal Web Agent with ADK

Playwright state management, screenshot-and-DOM grounding tools, and a multimodal ReAct loop assembled with ADK primitives.

Implementation Advanced
12 min read
implementation adk web-agent multimodal playwright

Goal#

This writeup implements a WebVoyager-style multimodal web agent with ADK and Playwright. The case study — WebVoyager — Multimodal Web Agent — explains the architecture: a vision-language model in a ReAct loop, Set-of-Mark grounding via numbered overlays on screenshots, a small discrete action space, and an observation pipeline that survives lazy-loading and dynamic content.

You’ll end with a runnable agent that: launches a browser, navigates to a URL, takes a screenshot, overlays numbered marks on interactive elements, presents the annotated screenshot plus a text mark-list to a vision-language model, executes the model’s chosen action, and repeats until done. The implementation is faithful to the architecture but small enough to read in one sitting.

Prerequisites#

Before starting:

  • WebVoyager — Multimodal Web Agent — the case study for context.
  • Setting Up and Grounding an Agent — the scaffold below assumes you’ve done a hello-world ADK project.
  • A vision-language model. Gemini 2.5 Pro, GPT-4o, or Claude 3.5 Sonnet — all support image input. Text-only models won’t work.
  • Playwright. uv add playwright && uv run playwright install chromium.
  • Pillow for image manipulation. uv add pillow.

Step-by-step#

1. The action space#

Start by pinning the action shapes. The agent’s output schema is the contract that holds the loop together.

actions.py
from pydantic import BaseModel
from typing import Literal, Union
class Click(BaseModel):
action: Literal["click"]
mark_id: int
reasoning: str
class Type(BaseModel):
action: Literal["type"]
mark_id: int
text: str
submit: bool = False
reasoning: str
class Scroll(BaseModel):
action: Literal["scroll"]
direction: Literal["up", "down", "page_up", "page_down"]
reasoning: str
class Wait(BaseModel):
action: Literal["wait"]
reasoning: str
class GoBack(BaseModel):
action: Literal["go_back"]
reasoning: str
class Answer(BaseModel):
action: Literal["answer"]
text: str
reasoning: str
AgentAction = Union[Click, Type, Scroll, Wait, GoBack, Answer]

Six action types. Each is a small pydantic model with a literal discriminator, the parameters, and a reasoning field the model fills in. The literal action field is what we’ll use to dispatch in the executor.

The reasoning field is for two reasons: it forces the model to articulate its step, which improves quality, and it gives you a debuggable trace.

2. The browser controller#

A thin wrapper around Playwright that handles navigation, waiting, and tear-down.

browser.py
from playwright.async_api import async_playwright, Browser, Page
class BrowserController:
def __init__(self):
self._pw = None
self._browser: Browser | None = None
self.page: Page | None = None
async def start(self, headless: bool = True):
self._pw = await async_playwright().start()
self._browser = await self._pw.chromium.launch(headless=headless)
ctx = await self._browser.new_context(
viewport={"width": 1280, "height": 800},
user_agent="Mozilla/5.0 ... AgentSDK/1.0",
)
self.page = await ctx.new_page()
async def goto(self, url: str):
await self.page.goto(url, wait_until="domcontentloaded")
await self.page.wait_for_load_state("networkidle", timeout=10_000)
async def click_at(self, x: int, y: int):
await self.page.mouse.click(x, y)
await self._settle()
async def type_at(self, x: int, y: int, text: str, submit: bool):
await self.page.mouse.click(x, y)
await self.page.keyboard.type(text)
if submit:
await self.page.keyboard.press("Enter")
await self._settle()
async def scroll(self, direction: str):
delta = {"up": -300, "down": 300, "page_up": -800, "page_down": 800}[direction]
await self.page.mouse.wheel(0, delta)
await self._settle(short=True)
async def go_back(self):
await self.page.go_back()
await self._settle()
async def _settle(self, short: bool = False):
timeout = 2000 if short else 5000
try:
await self.page.wait_for_load_state("networkidle", timeout=timeout)
except Exception:
pass # some pages never go idle; that's fine
async def stop(self):
if self._browser:
await self._browser.close()
if self._pw:
await self._pw.stop()

The _settle helper is the unglamorous load-bearing piece. After every action, wait for the page to quiet down before reading state again. The try/except around wait_for_load_state is because some pages (especially with long-polling or analytics beacons) never go fully idle; we cap the wait and move on.

3. Extract interactive elements and draw marks#

This is the grounding step. After each action, take a screenshot, find interactives, and overlay numbered boxes.

grounding.py
from PIL import Image, ImageDraw, ImageFont
import io, base64
from dataclasses import dataclass
@dataclass
class Mark:
mark_id: int
role: str # button, link, input, etc.
name: str # accessible name
bbox: tuple[int, int, int, int] # x, y, w, h
INTERACTIVE_SELECTOR = """
a, button, input:not([type=hidden]),
select, textarea, [role=button], [role=link],
[role=textbox], [role=combobox], [role=checkbox],
[role=radio], [tabindex]:not([tabindex='-1'])
"""
async def extract_marks(page) -> list[Mark]:
"""Find all interactive elements in the current viewport with usable bboxes."""
# Run in-page JS to collect candidates with their bounding rects + role + name
candidates = await page.evaluate(f"""
() => {{
const els = document.querySelectorAll(`{INTERACTIVE_SELECTOR}`);
const out = [];
for (const el of els) {{
const r = el.getBoundingClientRect();
if (r.width < 2 || r.height < 2) continue;
if (r.bottom < 0 || r.top > window.innerHeight) continue;
const style = window.getComputedStyle(el);
if (style.visibility === 'hidden' || style.display === 'none') continue;
out.push({{
role: el.getAttribute('role') || el.tagName.toLowerCase(),
name: (el.getAttribute('aria-label') || el.innerText || el.value || '').trim().slice(0, 60),
x: Math.round(r.left), y: Math.round(r.top),
w: Math.round(r.width), h: Math.round(r.height),
}});
}}
return out;
}}
""")
return [
Mark(mark_id=i, role=c["role"], name=c["name"],
bbox=(c["x"], c["y"], c["w"], c["h"]))
for i, c in enumerate(candidates)
]
async def take_annotated_screenshot(page, marks: list[Mark]) -> bytes:
"""Take a viewport screenshot, draw numbered marks on each interactive element."""
raw = await page.screenshot(type="png", full_page=False)
img = Image.open(io.BytesIO(raw)).convert("RGB")
draw = ImageDraw.Draw(img)
for m in marks:
x, y, w, h = m.bbox
# Box around the element
draw.rectangle([x, y, x + w, y + h], outline=(255, 64, 64), width=2)
# Mark number, top-left, with a contrasting backdrop
label = str(m.mark_id)
tw, th = draw.textbbox((0, 0), label)[2:]
bx0, by0 = x, y - th - 2
draw.rectangle([bx0, by0, bx0 + tw + 4, by0 + th + 2], fill=(255, 64, 64))
draw.text((bx0 + 2, by0), label, fill=(255, 255, 255))
out = io.BytesIO()
img.save(out, format="PNG")
return out.getvalue()
def marks_to_text(marks: list[Mark]) -> str:
lines = ["Marks visible in the screenshot:"]
for m in marks:
name = m.name or "(no label)"
lines.append(f" {m.mark_id}: <{m.role}> {name}")
return "\n".join(lines)

A few notes:

  • Filter by viewport. Off-screen elements aren’t useful to mark. We exclude anything outside the current viewport vertically.
  • Filter by visibility. display:none, visibility:hidden, and zero-size elements are skipped.
  • Truncate names to 60 chars. Long element names blow up the text mark-list. A user sees the screenshot for the full context.
  • The bbox is the source of truth for clicks. When the agent picks mark_id=7, we click the centre of mark 7’s bbox.

4. Build the agent#

ADK supports multimodal input. The agent receives the task, the annotated screenshot, and the text mark-list each turn.

agent.py
from pathlib import Path
from google.adk.agents import LlmAgent
from google.adk.types import StructuredOutput
from actions import AgentAction
SYSTEM_PROMPT = """You are a web agent that completes user tasks by
interacting with a browser.
Each turn you see:
- The task you are completing.
- An annotated screenshot of the current viewport. Interactive
elements are outlined in red and numbered.
- A text list of those marks: id, role, accessible name.
- Your action history so far.
Output exactly one action, as structured JSON matching the schema.
Rules:
1. Refer to elements by mark_id. Do not output coordinates.
2. If the goal is reached, output `answer` with the final result.
3. If a page is loading or animating, output `wait`.
4. If you took a wrong action and need to recover, use `go_back`.
5. Always explain your choice in the `reasoning` field.
6. Prefer fewer, higher-confidence actions over many guesses."""
def build_web_agent() -> LlmAgent:
return LlmAgent(
name="web-voyager",
model="gemini-2.5-pro", # multimodal-capable
instruction=SYSTEM_PROMPT,
output_schema=StructuredOutput(AgentAction),
)

The output_schema parameter tells ADK to enforce the action union — the model can’t emit free-form text that breaks the dispatch.

5. The executor — map action to browser call#

executor.py
from browser import BrowserController
from grounding import Mark
from actions import AgentAction, Click, Type, Scroll, Wait, GoBack, Answer
async def execute(
action: AgentAction,
browser: BrowserController,
marks: list[Mark],
) -> str:
"""Execute the action; return a short text result for the next observation."""
if isinstance(action, Click):
m = _get_mark(marks, action.mark_id)
cx, cy = _center(m)
await browser.click_at(cx, cy)
return f"clicked mark {action.mark_id} ({m.role}: {m.name!r})"
if isinstance(action, Type):
m = _get_mark(marks, action.mark_id)
cx, cy = _center(m)
await browser.type_at(cx, cy, action.text, submit=action.submit)
return f"typed into mark {action.mark_id}: {action.text!r}"
if isinstance(action, Scroll):
await browser.scroll(action.direction)
return f"scrolled {action.direction}"
if isinstance(action, Wait):
return "waited"
if isinstance(action, GoBack):
await browser.go_back()
return "went back"
if isinstance(action, Answer):
return f"ANSWERED: {action.text}"
raise ValueError(f"unknown action: {action}")
def _center(m: Mark) -> tuple[int, int]:
x, y, w, h = m.bbox
return x + w // 2, y + h // 2
def _get_mark(marks: list[Mark], mark_id: int) -> Mark:
for m in marks:
if m.mark_id == mark_id:
return m
raise ValueError(f"mark {mark_id} not in current observation")

The executor never trusts the mark IDs blindly — it looks them up against the current observation. If the model emits a stale mark ID (from an earlier screenshot), the lookup fails and we surface the error.

6. The outer loop#

loop.py
import base64
from google.adk.runners import Runner
from agent import build_web_agent
from browser import BrowserController
from grounding import extract_marks, take_annotated_screenshot, marks_to_text
from executor import execute
from actions import Answer
async def run_web_task(
task: str,
start_url: str,
max_steps: int = 30,
headless: bool = True,
) -> str:
browser = BrowserController()
await browser.start(headless=headless)
await browser.goto(start_url)
agent = build_web_agent()
runner = Runner(agent=agent, app_name="web-voyager")
history: list[str] = []
try:
for step in range(max_steps):
marks = await extract_marks(browser.page)
png = await take_annotated_screenshot(browser.page, marks)
mark_text = marks_to_text(marks)
user_message = _build_message(task, history, mark_text, png)
result = await runner.arun(user_message)
action = result.final_response # parsed by output_schema
print(f"step {step}: {action.action}{action.reasoning[:80]}")
try:
outcome = await execute(action, browser, marks)
except ValueError as e:
outcome = f"error: {e}"
history.append(f"step {step}: {action.action} -> {outcome}")
if isinstance(action, Answer):
return action.text
return "max_steps reached without answer"
finally:
await browser.stop()
def _build_message(task, history, mark_text, png_bytes):
return {
"text": (
f"TASK: {task}\n\n"
f"HISTORY:\n" + ("\n".join(history) if history else "(none)") + "\n\n"
f"{mark_text}"
),
"image": {"mime_type": "image/png", "data": base64.b64encode(png_bytes)},
}

The message contract is text + image. The text holds the task, the action history (a compact summary, not the full transcript), and the text mark-list. The image holds the annotated screenshot.

7. Run#

main.py
import asyncio
from loop import run_web_task
from dotenv import load_dotenv
load_dotenv()
async def main():
result = await run_web_task(
task="Find the title of the most recent paper on arxiv.org cs.CL",
start_url="https://arxiv.org/list/cs.CL/recent",
max_steps=15,
headless=False,
)
print("FINAL:", result)
asyncio.run(main())

For your first runs, set headless=False so you can watch the browser drive — it’s faster to debug visually than from text traces. Switch to headless for batch runs.

Code structure#

web-agent/
├── .env
├── pyproject.toml
├── actions.py (pydantic action union)
├── browser.py (Playwright controller)
├── grounding.py (mark extraction + annotated screenshot)
├── executor.py (action dispatch)
├── agent.py (LlmAgent factory + system prompt)
├── loop.py (run_web_task)
├── main.py (entry point)
└── prompts/
└── system.md (mirror of SYSTEM_PROMPT)

The clean separation between what is grounded where matters here more than in a text-only agent. The browser code knows about pixels and Playwright; the grounding code knows about marks and bboxes; the agent code knows about actions and reasoning. Crossing these boundaries is what causes web agents to break in subtle ways.

Loop control and exit conditions#

Three exit conditions for a web agent:

# already present in run_web_task
1. action == Answer → success, return text
2. step == max_steps → budget exhausted, return failure text
3. exception raised → safety net, tear down browser

A useful extra exit condition is stagnation detection — if the last K actions all targeted the same mark or scrolled in the same direction without progress, abort. This catches the failure mode where the agent gets stuck clicking a non-interactive button or scrolling past a target.

# loop.py — stagnation snippet
if len(history) >= 4:
last_actions = [h.split(":")[1].split("->")[0].strip() for h in history[-4:]]
if len(set(last_actions)) == 1:
print("stagnation: same action 4 times in a row; aborting")
return "stuck"

A wall-clock timeout matters as much as a step cap — a single page that takes 30s to settle can blow past the time budget without exceeding the step budget.

Common pitfalls#

The pitfalls that hit multimodal web-agent builders most often:

  • Stale screenshots. The agent acts on a screenshot taken before the previous action settled. Always re-screenshot at the start of each loop iteration.
  • Marks drawn off-viewport. Drawing a mark at y=2000 in an 800-tall viewport is invisible. Filter by viewport in the extractor.
  • Tiny marks. The visible number 7 over a 12x10 button is unreadable. Either ensure the mark label backdrop is large enough or pick a clearer font; the model literally needs to see the digit.
  • No wait action. Without wait, the agent fakes a wait by clicking somewhere useless. Add the action; the loop tightens up.
  • Mark ID drift. Different mark-extraction order between turns means mark 7 yesterday and mark 7 today are different elements. Either log the role+name with the action history, or include them in the action’s parameters as a sanity check.
  • Ignoring cookie banners. Many real sites have a cookie modal that obscures content. The agent either has to dismiss it (its own first action) or you have to handle it in the browser controller. Add a pre-task “dismiss obvious cookie banners” step if you’re not measuring the agent’s ability to handle them.
  • No headful-mode debugging. When the agent fails, watching it run in a visible browser shows you why in seconds. Headless-first development is harder than it needs to be.
  • Trusting the model’s coordinate output. Don’t let the model output (x, y) pairs directly. The whole point of marks is to discretise the action space; let it out and the model starts hallucinating coordinates.
Why filter marks to the current viewport?

Two reasons. First, the model only sees the viewport in the screenshot — drawing marks on off-screen elements wastes mark IDs and the model can’t reason about elements it can’t see. Second, clicking an off-screen mark causes Playwright to either scroll-into-view (which can race with the page’s own scroll behaviour) or fail outright. Better to discipline the action surface: the agent can only act on what it can see, and if it needs to act on something below the fold, it must scroll first. The scroll-then-act pattern is what makes the loop predictable.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.