Building a Multimodal Web Agent with ADK — Agentic

Goal#

This writeup implements a WebVoyager-style multimodal web agent with ADK and Playwright. The case study — WebVoyager — Multimodal Web Agent — explains the architecture: a vision-language model in a ReAct loop, Set-of-Mark grounding via numbered overlays on screenshots, a small discrete action space, and an observation pipeline that survives lazy-loading and dynamic content.

You’ll end with a runnable agent that: launches a browser, navigates to a URL, takes a screenshot, overlays numbered marks on interactive elements, presents the annotated screenshot plus a text mark-list to a vision-language model, executes the model’s chosen action, and repeats until done. The implementation is faithful to the architecture but small enough to read in one sitting.

Prerequisites#

Before starting:

WebVoyager — Multimodal Web Agent — the case study for context.
Setting Up and Grounding an Agent — the scaffold below assumes you’ve done a hello-world ADK project.
A vision-language model. Gemini 2.5 Pro, GPT-4o, or Claude 3.5 Sonnet — all support image input. Text-only models won’t work.
Playwright. uv add playwright && uv run playwright install chromium.
Pillow for image manipulation. uv add pillow.

Step-by-step#

1. The action space#

Start by pinning the action shapes. The agent’s output schema is the contract that holds the loop together.

from pydantic import BaseModel
from typing import Literal, Union

class Click(BaseModel):
    action: Literal["click"]
    mark_id: int
    reasoning: str

class Type(BaseModel):
    action: Literal["type"]
    mark_id: int
    text: str
    submit: bool = False
    reasoning: str

class Scroll(BaseModel):
    action: Literal["scroll"]
    direction: Literal["up", "down", "page_up", "page_down"]
    reasoning: str

class Wait(BaseModel):
    action: Literal["wait"]
    reasoning: str

class GoBack(BaseModel):
    action: Literal["go_back"]
    reasoning: str

class Answer(BaseModel):
    action: Literal["answer"]
    text: str
    reasoning: str

AgentAction = Union[Click, Type, Scroll, Wait, GoBack, Answer]

Six action types. Each is a small pydantic model with a literal discriminator, the parameters, and a reasoning field the model fills in. The literal action field is what we’ll use to dispatch in the executor.

The reasoning field is for two reasons: it forces the model to articulate its step, which improves quality, and it gives you a debuggable trace.

2. The browser controller#

A thin wrapper around Playwright that handles navigation, waiting, and tear-down.

from playwright.async_api import async_playwright, Browser, Page

class BrowserController:
    def __init__(self):
        self._pw = None
        self._browser: Browser | None = None
        self.page: Page | None = None

    async def start(self, headless: bool = True):
        self._pw = await async_playwright().start()
        self._browser = await self._pw.chromium.launch(headless=headless)
        ctx = await self._browser.new_context(
            viewport={"width": 1280, "height": 800},
            user_agent="Mozilla/5.0 ... AgentSDK/1.0",
        )
        self.page = await ctx.new_page()

    async def goto(self, url: str):
        await self.page.goto(url, wait_until="domcontentloaded")
        await self.page.wait_for_load_state("networkidle", timeout=10_000)

    async def click_at(self, x: int, y: int):
        await self.page.mouse.click(x, y)
        await self._settle()

    async def type_at(self, x: int, y: int, text: str, submit: bool):
        await self.page.mouse.click(x, y)
        await self.page.keyboard.type(text)
        if submit:
            await self.page.keyboard.press("Enter")
        await self._settle()

    async def scroll(self, direction: str):
        delta = {"up": -300, "down": 300, "page_up": -800, "page_down": 800}[direction]
        await self.page.mouse.wheel(0, delta)
        await self._settle(short=True)

    async def go_back(self):
        await self.page.go_back()
        await self._settle()

    async def _settle(self, short: bool = False):
        timeout = 2000 if short else 5000
        try:
            await self.page.wait_for_load_state("networkidle", timeout=timeout)
        except Exception:
            pass  # some pages never go idle; that's fine

    async def stop(self):
        if self._browser:
            await self._browser.close()
        if self._pw:
            await self._pw.stop()

The _settle helper is the unglamorous load-bearing piece. After every action, wait for the page to quiet down before reading state again. The try/except around wait_for_load_state is because some pages (especially with long-polling or analytics beacons) never go fully idle; we cap the wait and move on.

3. Extract interactive elements and draw marks#

This is the grounding step. After each action, take a screenshot, find interactives, and overlay numbered boxes.

from PIL import Image, ImageDraw, ImageFont
import io, base64
from dataclasses import dataclass

@dataclass
class Mark:
    mark_id: int
    role: str       # button, link, input, etc.
    name: str       # accessible name
    bbox: tuple[int, int, int, int]   # x, y, w, h


INTERACTIVE_SELECTOR = """
    a, button, input:not([type=hidden]),
    select, textarea, [role=button], [role=link],
    [role=textbox], [role=combobox], [role=checkbox],
    [role=radio], [tabindex]:not([tabindex='-1'])
"""

async def extract_marks(page) -> list[Mark]:
    """Find all interactive elements in the current viewport with usable bboxes."""
    # Run in-page JS to collect candidates with their bounding rects + role + name
    candidates = await page.evaluate(f"""
        () => {{
            const els = document.querySelectorAll(`{INTERACTIVE_SELECTOR}`);
            const out = [];
            for (const el of els) {{
                const r = el.getBoundingClientRect();
                if (r.width < 2 || r.height < 2) continue;
                if (r.bottom < 0 || r.top > window.innerHeight) continue;
                const style = window.getComputedStyle(el);
                if (style.visibility === 'hidden' || style.display === 'none') continue;
                out.push({{
                    role: el.getAttribute('role') || el.tagName.toLowerCase(),
                    name: (el.getAttribute('aria-label') || el.innerText || el.value || '').trim().slice(0, 60),
                    x: Math.round(r.left), y: Math.round(r.top),
                    w: Math.round(r.width), h: Math.round(r.height),
                }});
            }}
            return out;
        }}
    """)

    return [
        Mark(mark_id=i, role=c["role"], name=c["name"],
             bbox=(c["x"], c["y"], c["w"], c["h"]))
        for i, c in enumerate(candidates)
    ]


async def take_annotated_screenshot(page, marks: list[Mark]) -> bytes:
    """Take a viewport screenshot, draw numbered marks on each interactive element."""
    raw = await page.screenshot(type="png", full_page=False)
    img = Image.open(io.BytesIO(raw)).convert("RGB")
    draw = ImageDraw.Draw(img)

    for m in marks:
        x, y, w, h = m.bbox
        # Box around the element
        draw.rectangle([x, y, x + w, y + h], outline=(255, 64, 64), width=2)
        # Mark number, top-left, with a contrasting backdrop
        label = str(m.mark_id)
        tw, th = draw.textbbox((0, 0), label)[2:]
        bx0, by0 = x, y - th - 2
        draw.rectangle([bx0, by0, bx0 + tw + 4, by0 + th + 2], fill=(255, 64, 64))
        draw.text((bx0 + 2, by0), label, fill=(255, 255, 255))

    out = io.BytesIO()
    img.save(out, format="PNG")
    return out.getvalue()


def marks_to_text(marks: list[Mark]) -> str:
    lines = ["Marks visible in the screenshot:"]
    for m in marks:
        name = m.name or "(no label)"
        lines.append(f"  {m.mark_id}: <{m.role}> {name}")
    return "\n".join(lines)

A few notes:

Filter by viewport. Off-screen elements aren’t useful to mark. We exclude anything outside the current viewport vertically.
Filter by visibility. display:none, visibility:hidden, and zero-size elements are skipped.
Truncate names to 60 chars. Long element names blow up the text mark-list. A user sees the screenshot for the full context.
The bbox is the source of truth for clicks. When the agent picks mark_id=7, we click the centre of mark 7’s bbox.

4. Build the agent#

ADK supports multimodal input. The agent receives the task, the annotated screenshot, and the text mark-list each turn.

from pathlib import Path
from google.adk.agents import LlmAgent
from google.adk.types import StructuredOutput
from actions import AgentAction

SYSTEM_PROMPT = """You are a web agent that completes user tasks by
interacting with a browser.

Each turn you see:
- The task you are completing.
- An annotated screenshot of the current viewport. Interactive
  elements are outlined in red and numbered.
- A text list of those marks: id, role, accessible name.
- Your action history so far.

Output exactly one action, as structured JSON matching the schema.
Rules:
1. Refer to elements by mark_id. Do not output coordinates.
2. If the goal is reached, output `answer` with the final result.
3. If a page is loading or animating, output `wait`.
4. If you took a wrong action and need to recover, use `go_back`.
5. Always explain your choice in the `reasoning` field.
6. Prefer fewer, higher-confidence actions over many guesses."""


def build_web_agent() -> LlmAgent:
    return LlmAgent(
        name="web-voyager",
        model="gemini-2.5-pro",      # multimodal-capable
        instruction=SYSTEM_PROMPT,
        output_schema=StructuredOutput(AgentAction),
    )

The output_schema parameter tells ADK to enforce the action union — the model can’t emit free-form text that breaks the dispatch.

5. The executor — map action to browser call#

from browser import BrowserController
from grounding import Mark
from actions import AgentAction, Click, Type, Scroll, Wait, GoBack, Answer

async def execute(
    action: AgentAction,
    browser: BrowserController,
    marks: list[Mark],
) -> str:
    """Execute the action; return a short text result for the next observation."""
    if isinstance(action, Click):
        m = _get_mark(marks, action.mark_id)
        cx, cy = _center(m)
        await browser.click_at(cx, cy)
        return f"clicked mark {action.mark_id} ({m.role}: {m.name!r})"

    if isinstance(action, Type):
        m = _get_mark(marks, action.mark_id)
        cx, cy = _center(m)
        await browser.type_at(cx, cy, action.text, submit=action.submit)
        return f"typed into mark {action.mark_id}: {action.text!r}"

    if isinstance(action, Scroll):
        await browser.scroll(action.direction)
        return f"scrolled {action.direction}"

    if isinstance(action, Wait):
        return "waited"

    if isinstance(action, GoBack):
        await browser.go_back()
        return "went back"

    if isinstance(action, Answer):
        return f"ANSWERED: {action.text}"

    raise ValueError(f"unknown action: {action}")


def _center(m: Mark) -> tuple[int, int]:
    x, y, w, h = m.bbox
    return x + w // 2, y + h // 2


def _get_mark(marks: list[Mark], mark_id: int) -> Mark:
    for m in marks:
        if m.mark_id == mark_id:
            return m
    raise ValueError(f"mark {mark_id} not in current observation")

The executor never trusts the mark IDs blindly — it looks them up against the current observation. If the model emits a stale mark ID (from an earlier screenshot), the lookup fails and we surface the error.

6. The outer loop#

import base64
from google.adk.runners import Runner
from agent import build_web_agent
from browser import BrowserController
from grounding import extract_marks, take_annotated_screenshot, marks_to_text
from executor import execute
from actions import Answer

async def run_web_task(
    task: str,
    start_url: str,
    max_steps: int = 30,
    headless: bool = True,
) -> str:
    browser = BrowserController()
    await browser.start(headless=headless)
    await browser.goto(start_url)

    agent = build_web_agent()
    runner = Runner(agent=agent, app_name="web-voyager")

    history: list[str] = []
    try:
        for step in range(max_steps):
            marks = await extract_marks(browser.page)
            png = await take_annotated_screenshot(browser.page, marks)
            mark_text = marks_to_text(marks)

            user_message = _build_message(task, history, mark_text, png)
            result = await runner.arun(user_message)
            action = result.final_response   # parsed by output_schema

            print(f"step {step}: {action.action} — {action.reasoning[:80]}")

            try:
                outcome = await execute(action, browser, marks)
            except ValueError as e:
                outcome = f"error: {e}"

            history.append(f"step {step}: {action.action} -> {outcome}")

            if isinstance(action, Answer):
                return action.text

        return "max_steps reached without answer"
    finally:
        await browser.stop()


def _build_message(task, history, mark_text, png_bytes):
    return {
        "text": (
            f"TASK: {task}\n\n"
            f"HISTORY:\n" + ("\n".join(history) if history else "(none)") + "\n\n"
            f"{mark_text}"
        ),
        "image": {"mime_type": "image/png", "data": base64.b64encode(png_bytes)},
    }

The message contract is text + image. The text holds the task, the action history (a compact summary, not the full transcript), and the text mark-list. The image holds the annotated screenshot.

7. Run#

import asyncio
from loop import run_web_task
from dotenv import load_dotenv
load_dotenv()

async def main():
    result = await run_web_task(
        task="Find the title of the most recent paper on arxiv.org cs.CL",
        start_url="https://arxiv.org/list/cs.CL/recent",
        max_steps=15,
        headless=False,
    )
    print("FINAL:", result)

asyncio.run(main())

For your first runs, set headless=False so you can watch the browser drive — it’s faster to debug visually than from text traces. Switch to headless for batch runs.

Code structure#

web-agent/
├── .env
├── pyproject.toml
├── actions.py            (pydantic action union)
├── browser.py            (Playwright controller)
├── grounding.py          (mark extraction + annotated screenshot)
├── executor.py           (action dispatch)
├── agent.py              (LlmAgent factory + system prompt)
├── loop.py               (run_web_task)
├── main.py               (entry point)
└── prompts/
    └── system.md         (mirror of SYSTEM_PROMPT)

The clean separation between what is grounded where matters here more than in a text-only agent. The browser code knows about pixels and Playwright; the grounding code knows about marks and bboxes; the agent code knows about actions and reasoning. Crossing these boundaries is what causes web agents to break in subtle ways.

Loop control and exit conditions#

Three exit conditions for a web agent:

# already present in run_web_task
1. action == Answer        → success, return text
2. step == max_steps       → budget exhausted, return failure text
3. exception raised        → safety net, tear down browser

A useful extra exit condition is stagnation detection — if the last K actions all targeted the same mark or scrolled in the same direction without progress, abort. This catches the failure mode where the agent gets stuck clicking a non-interactive button or scrolling past a target.

# loop.py — stagnation snippet
if len(history) >= 4:
    last_actions = [h.split(":")[1].split("->")[0].strip() for h in history[-4:]]
    if len(set(last_actions)) == 1:
        print("stagnation: same action 4 times in a row; aborting")
        return "stuck"

A wall-clock timeout matters as much as a step cap — a single page that takes 30s to settle can blow past the time budget without exceeding the step budget.

Common pitfalls#

The pitfalls that hit multimodal web-agent builders most often:

Stale screenshots. The agent acts on a screenshot taken before the previous action settled. Always re-screenshot at the start of each loop iteration.
Marks drawn off-viewport. Drawing a mark at y=2000 in an 800-tall viewport is invisible. Filter by viewport in the extractor.
Tiny marks. The visible number 7 over a 12x10 button is unreadable. Either ensure the mark label backdrop is large enough or pick a clearer font; the model literally needs to see the digit.
No wait action. Without wait, the agent fakes a wait by clicking somewhere useless. Add the action; the loop tightens up.
Mark ID drift. Different mark-extraction order between turns means mark 7 yesterday and mark 7 today are different elements. Either log the role+name with the action history, or include them in the action’s parameters as a sanity check.
Ignoring cookie banners. Many real sites have a cookie modal that obscures content. The agent either has to dismiss it (its own first action) or you have to handle it in the browser controller. Add a pre-task “dismiss obvious cookie banners” step if you’re not measuring the agent’s ability to handle them.
No headful-mode debugging. When the agent fails, watching it run in a visible browser shows you why in seconds. Headless-first development is harder than it needs to be.
Trusting the model’s coordinate output. Don’t let the model output (x, y) pairs directly. The whole point of marks is to discretise the action space; let it out and the model starts hallucinating coordinates.

Why filter marks to the current viewport?

Two reasons. First, the model only sees the viewport in the screenshot — drawing marks on off-screen elements wastes mark IDs and the model can’t reason about elements it can’t see. Second, clicking an off-screen mark causes Playwright to either scroll-into-view (which can race with the page’s own scroll behaviour) or fail outright. Better to discipline the action surface: the agent can only act on what it can see, and if it needs to act on something below the fold, it must scroll first. The scroll-then-act pattern is what makes the loop predictable.