Building a Multimodal Web Agent with ADK
Playwright state management, screenshot-and-DOM grounding tools, and a multimodal ReAct loop assembled with ADK primitives.
Goal#
This writeup implements a WebVoyager-style multimodal web agent with ADK and Playwright. The case study — WebVoyager — Multimodal Web Agent — explains the architecture: a vision-language model in a ReAct loop, Set-of-Mark grounding via numbered overlays on screenshots, a small discrete action space, and an observation pipeline that survives lazy-loading and dynamic content.
You’ll end with a runnable agent that: launches a browser, navigates to a URL, takes a screenshot, overlays numbered marks on interactive elements, presents the annotated screenshot plus a text mark-list to a vision-language model, executes the model’s chosen action, and repeats until done. The implementation is faithful to the architecture but small enough to read in one sitting.
Prerequisites#
Before starting:
- WebVoyager — Multimodal Web Agent — the case study for context.
- Setting Up and Grounding an Agent — the scaffold below assumes you’ve done a hello-world ADK project.
- A vision-language model. Gemini 2.5 Pro, GPT-4o, or Claude 3.5 Sonnet — all support image input. Text-only models won’t work.
- Playwright.
uv add playwright && uv run playwright install chromium. - Pillow for image manipulation.
uv add pillow.
Step-by-step#
1. The action space#
Start by pinning the action shapes. The agent’s output schema is the contract that holds the loop together.
from pydantic import BaseModelfrom typing import Literal, Union
class Click(BaseModel): action: Literal["click"] mark_id: int reasoning: str
class Type(BaseModel): action: Literal["type"] mark_id: int text: str submit: bool = False reasoning: str
class Scroll(BaseModel): action: Literal["scroll"] direction: Literal["up", "down", "page_up", "page_down"] reasoning: str
class Wait(BaseModel): action: Literal["wait"] reasoning: str
class GoBack(BaseModel): action: Literal["go_back"] reasoning: str
class Answer(BaseModel): action: Literal["answer"] text: str reasoning: str
AgentAction = Union[Click, Type, Scroll, Wait, GoBack, Answer]Six action types. Each is a small pydantic model with a literal discriminator, the parameters, and a reasoning field the model fills in. The literal action field is what we’ll use to dispatch in the executor.
The reasoning field is for two reasons: it forces the model to articulate its step, which improves quality, and it gives you a debuggable trace.
2. The browser controller#
A thin wrapper around Playwright that handles navigation, waiting, and tear-down.
from playwright.async_api import async_playwright, Browser, Page
class BrowserController: def __init__(self): self._pw = None self._browser: Browser | None = None self.page: Page | None = None
async def start(self, headless: bool = True): self._pw = await async_playwright().start() self._browser = await self._pw.chromium.launch(headless=headless) ctx = await self._browser.new_context( viewport={"width": 1280, "height": 800}, user_agent="Mozilla/5.0 ... AgentSDK/1.0", ) self.page = await ctx.new_page()
async def goto(self, url: str): await self.page.goto(url, wait_until="domcontentloaded") await self.page.wait_for_load_state("networkidle", timeout=10_000)
async def click_at(self, x: int, y: int): await self.page.mouse.click(x, y) await self._settle()
async def type_at(self, x: int, y: int, text: str, submit: bool): await self.page.mouse.click(x, y) await self.page.keyboard.type(text) if submit: await self.page.keyboard.press("Enter") await self._settle()
async def scroll(self, direction: str): delta = {"up": -300, "down": 300, "page_up": -800, "page_down": 800}[direction] await self.page.mouse.wheel(0, delta) await self._settle(short=True)
async def go_back(self): await self.page.go_back() await self._settle()
async def _settle(self, short: bool = False): timeout = 2000 if short else 5000 try: await self.page.wait_for_load_state("networkidle", timeout=timeout) except Exception: pass # some pages never go idle; that's fine
async def stop(self): if self._browser: await self._browser.close() if self._pw: await self._pw.stop()The _settle helper is the unglamorous load-bearing piece. After every action, wait for the page to quiet down before reading state again. The try/except around wait_for_load_state is because some pages (especially with long-polling or analytics beacons) never go fully idle; we cap the wait and move on.
3. Extract interactive elements and draw marks#
This is the grounding step. After each action, take a screenshot, find interactives, and overlay numbered boxes.
from PIL import Image, ImageDraw, ImageFontimport io, base64from dataclasses import dataclass
@dataclassclass Mark: mark_id: int role: str # button, link, input, etc. name: str # accessible name bbox: tuple[int, int, int, int] # x, y, w, h
INTERACTIVE_SELECTOR = """ a, button, input:not([type=hidden]), select, textarea, [role=button], [role=link], [role=textbox], [role=combobox], [role=checkbox], [role=radio], [tabindex]:not([tabindex='-1'])"""
async def extract_marks(page) -> list[Mark]: """Find all interactive elements in the current viewport with usable bboxes.""" # Run in-page JS to collect candidates with their bounding rects + role + name candidates = await page.evaluate(f""" () => {{ const els = document.querySelectorAll(`{INTERACTIVE_SELECTOR}`); const out = []; for (const el of els) {{ const r = el.getBoundingClientRect(); if (r.width < 2 || r.height < 2) continue; if (r.bottom < 0 || r.top > window.innerHeight) continue; const style = window.getComputedStyle(el); if (style.visibility === 'hidden' || style.display === 'none') continue; out.push({{ role: el.getAttribute('role') || el.tagName.toLowerCase(), name: (el.getAttribute('aria-label') || el.innerText || el.value || '').trim().slice(0, 60), x: Math.round(r.left), y: Math.round(r.top), w: Math.round(r.width), h: Math.round(r.height), }}); }} return out; }} """)
return [ Mark(mark_id=i, role=c["role"], name=c["name"], bbox=(c["x"], c["y"], c["w"], c["h"])) for i, c in enumerate(candidates) ]
async def take_annotated_screenshot(page, marks: list[Mark]) -> bytes: """Take a viewport screenshot, draw numbered marks on each interactive element.""" raw = await page.screenshot(type="png", full_page=False) img = Image.open(io.BytesIO(raw)).convert("RGB") draw = ImageDraw.Draw(img)
for m in marks: x, y, w, h = m.bbox # Box around the element draw.rectangle([x, y, x + w, y + h], outline=(255, 64, 64), width=2) # Mark number, top-left, with a contrasting backdrop label = str(m.mark_id) tw, th = draw.textbbox((0, 0), label)[2:] bx0, by0 = x, y - th - 2 draw.rectangle([bx0, by0, bx0 + tw + 4, by0 + th + 2], fill=(255, 64, 64)) draw.text((bx0 + 2, by0), label, fill=(255, 255, 255))
out = io.BytesIO() img.save(out, format="PNG") return out.getvalue()
def marks_to_text(marks: list[Mark]) -> str: lines = ["Marks visible in the screenshot:"] for m in marks: name = m.name or "(no label)" lines.append(f" {m.mark_id}: <{m.role}> {name}") return "\n".join(lines)A few notes:
- Filter by viewport. Off-screen elements aren’t useful to mark. We exclude anything outside the current viewport vertically.
- Filter by visibility.
display:none,visibility:hidden, and zero-size elements are skipped. - Truncate names to 60 chars. Long element names blow up the text mark-list. A user sees the screenshot for the full context.
- The bbox is the source of truth for clicks. When the agent picks
mark_id=7, we click the centre of mark 7’s bbox.
4. Build the agent#
ADK supports multimodal input. The agent receives the task, the annotated screenshot, and the text mark-list each turn.
from pathlib import Pathfrom google.adk.agents import LlmAgentfrom google.adk.types import StructuredOutputfrom actions import AgentAction
SYSTEM_PROMPT = """You are a web agent that completes user tasks byinteracting with a browser.
Each turn you see:- The task you are completing.- An annotated screenshot of the current viewport. Interactive elements are outlined in red and numbered.- A text list of those marks: id, role, accessible name.- Your action history so far.
Output exactly one action, as structured JSON matching the schema.Rules:1. Refer to elements by mark_id. Do not output coordinates.2. If the goal is reached, output `answer` with the final result.3. If a page is loading or animating, output `wait`.4. If you took a wrong action and need to recover, use `go_back`.5. Always explain your choice in the `reasoning` field.6. Prefer fewer, higher-confidence actions over many guesses."""
def build_web_agent() -> LlmAgent: return LlmAgent( name="web-voyager", model="gemini-2.5-pro", # multimodal-capable instruction=SYSTEM_PROMPT, output_schema=StructuredOutput(AgentAction), )The output_schema parameter tells ADK to enforce the action union — the model can’t emit free-form text that breaks the dispatch.
5. The executor — map action to browser call#
from browser import BrowserControllerfrom grounding import Markfrom actions import AgentAction, Click, Type, Scroll, Wait, GoBack, Answer
async def execute( action: AgentAction, browser: BrowserController, marks: list[Mark],) -> str: """Execute the action; return a short text result for the next observation.""" if isinstance(action, Click): m = _get_mark(marks, action.mark_id) cx, cy = _center(m) await browser.click_at(cx, cy) return f"clicked mark {action.mark_id} ({m.role}: {m.name!r})"
if isinstance(action, Type): m = _get_mark(marks, action.mark_id) cx, cy = _center(m) await browser.type_at(cx, cy, action.text, submit=action.submit) return f"typed into mark {action.mark_id}: {action.text!r}"
if isinstance(action, Scroll): await browser.scroll(action.direction) return f"scrolled {action.direction}"
if isinstance(action, Wait): return "waited"
if isinstance(action, GoBack): await browser.go_back() return "went back"
if isinstance(action, Answer): return f"ANSWERED: {action.text}"
raise ValueError(f"unknown action: {action}")
def _center(m: Mark) -> tuple[int, int]: x, y, w, h = m.bbox return x + w // 2, y + h // 2
def _get_mark(marks: list[Mark], mark_id: int) -> Mark: for m in marks: if m.mark_id == mark_id: return m raise ValueError(f"mark {mark_id} not in current observation")The executor never trusts the mark IDs blindly — it looks them up against the current observation. If the model emits a stale mark ID (from an earlier screenshot), the lookup fails and we surface the error.
6. The outer loop#
import base64from google.adk.runners import Runnerfrom agent import build_web_agentfrom browser import BrowserControllerfrom grounding import extract_marks, take_annotated_screenshot, marks_to_textfrom executor import executefrom actions import Answer
async def run_web_task( task: str, start_url: str, max_steps: int = 30, headless: bool = True,) -> str: browser = BrowserController() await browser.start(headless=headless) await browser.goto(start_url)
agent = build_web_agent() runner = Runner(agent=agent, app_name="web-voyager")
history: list[str] = [] try: for step in range(max_steps): marks = await extract_marks(browser.page) png = await take_annotated_screenshot(browser.page, marks) mark_text = marks_to_text(marks)
user_message = _build_message(task, history, mark_text, png) result = await runner.arun(user_message) action = result.final_response # parsed by output_schema
print(f"step {step}: {action.action} — {action.reasoning[:80]}")
try: outcome = await execute(action, browser, marks) except ValueError as e: outcome = f"error: {e}"
history.append(f"step {step}: {action.action} -> {outcome}")
if isinstance(action, Answer): return action.text
return "max_steps reached without answer" finally: await browser.stop()
def _build_message(task, history, mark_text, png_bytes): return { "text": ( f"TASK: {task}\n\n" f"HISTORY:\n" + ("\n".join(history) if history else "(none)") + "\n\n" f"{mark_text}" ), "image": {"mime_type": "image/png", "data": base64.b64encode(png_bytes)}, }The message contract is text + image. The text holds the task, the action history (a compact summary, not the full transcript), and the text mark-list. The image holds the annotated screenshot.
7. Run#
import asynciofrom loop import run_web_taskfrom dotenv import load_dotenvload_dotenv()
async def main(): result = await run_web_task( task="Find the title of the most recent paper on arxiv.org cs.CL", start_url="https://arxiv.org/list/cs.CL/recent", max_steps=15, headless=False, ) print("FINAL:", result)
asyncio.run(main())For your first runs, set headless=False so you can watch the browser drive — it’s faster to debug visually than from text traces. Switch to headless for batch runs.
Code structure#
web-agent/├── .env├── pyproject.toml├── actions.py (pydantic action union)├── browser.py (Playwright controller)├── grounding.py (mark extraction + annotated screenshot)├── executor.py (action dispatch)├── agent.py (LlmAgent factory + system prompt)├── loop.py (run_web_task)├── main.py (entry point)└── prompts/ └── system.md (mirror of SYSTEM_PROMPT)The clean separation between what is grounded where matters here more than in a text-only agent. The browser code knows about pixels and Playwright; the grounding code knows about marks and bboxes; the agent code knows about actions and reasoning. Crossing these boundaries is what causes web agents to break in subtle ways.
Loop control and exit conditions#
Three exit conditions for a web agent:
# already present in run_web_task1. action == Answer → success, return text2. step == max_steps → budget exhausted, return failure text3. exception raised → safety net, tear down browserA useful extra exit condition is stagnation detection — if the last K actions all targeted the same mark or scrolled in the same direction without progress, abort. This catches the failure mode where the agent gets stuck clicking a non-interactive button or scrolling past a target.
# loop.py — stagnation snippetif len(history) >= 4: last_actions = [h.split(":")[1].split("->")[0].strip() for h in history[-4:]] if len(set(last_actions)) == 1: print("stagnation: same action 4 times in a row; aborting") return "stuck"A wall-clock timeout matters as much as a step cap — a single page that takes 30s to settle can blow past the time budget without exceeding the step budget.
Common pitfalls#
The pitfalls that hit multimodal web-agent builders most often:
- Stale screenshots. The agent acts on a screenshot taken before the previous action settled. Always re-screenshot at the start of each loop iteration.
- Marks drawn off-viewport. Drawing a mark at y=2000 in an 800-tall viewport is invisible. Filter by viewport in the extractor.
- Tiny marks. The visible number
7over a12x10button is unreadable. Either ensure the mark label backdrop is large enough or pick a clearer font; the model literally needs to see the digit. - No
waitaction. Withoutwait, the agent fakes a wait by clicking somewhere useless. Add the action; the loop tightens up. - Mark ID drift. Different mark-extraction order between turns means mark 7 yesterday and mark 7 today are different elements. Either log the role+name with the action history, or include them in the action’s parameters as a sanity check.
- Ignoring cookie banners. Many real sites have a cookie modal that obscures content. The agent either has to dismiss it (its own first action) or you have to handle it in the browser controller. Add a pre-task “dismiss obvious cookie banners” step if you’re not measuring the agent’s ability to handle them.
- No headful-mode debugging. When the agent fails, watching it run in a visible browser shows you why in seconds. Headless-first development is harder than it needs to be.
- Trusting the model’s coordinate output. Don’t let the model output
(x, y)pairs directly. The whole point of marks is to discretise the action space; let it out and the model starts hallucinating coordinates.
Why filter marks to the current viewport?
Two reasons. First, the model only sees the viewport in the screenshot — drawing marks on off-screen elements wastes mark IDs and the model can’t reason about elements it can’t see. Second, clicking an off-screen mark causes Playwright to either scroll-into-view (which can race with the page’s own scroll behaviour) or fail outright. Better to discipline the action surface: the agent can only act on what it can see, and if it needs to act on something below the fold, it must scroll first. The scroll-then-act pattern is what makes the loop predictable.
Related implementations#