Perception and Grounding

How agents take input — text, vision, audio, structured data — and ground it to actions in their environment.

Concept Foundational
8 min read
perception grounding multimodal observations

Summary#

Perception is how an agent takes in the world: text, screenshots, audio, sensor streams, structured payloads. Grounding is the harder problem of mapping that perception onto the actions the agent can actually take — turning a pixel at position (x, y) into “click the submit button”, or turning a JSON blob into “the user wants order #4421 refunded”.

A model can be brilliant at reasoning and still fail catastrophically if perception is noisy or grounding is wrong. The agent doesn’t see the environment directly; it sees whatever the runtime feeds in. That makes the perception pipeline — what gets captured, how it’s encoded, how it’s referenced in the prompt — load-bearing in a way that’s easy to underestimate.

Why it matters#

Three reasons perception-and-grounding is where most “real-world agent” projects go off the rails:

  • Reasoning is only as good as observations. A web agent that gets a fuzzy screenshot and a flattened DOM blob will misclick. A diagnosis agent that gets a free-text summary instead of the raw labs will miss findings. Bad perception silently degrades every downstream step.
  • Grounding errors are silent. If the agent picks the wrong object — clicks the wrong row, edits the wrong file, queries the wrong table — the rest of the trajectory looks coherent but is solving the wrong problem. End-state grading catches this only at the very end.
  • Multimodality changes the system, not just the prompt. Adding vision isn’t a one-line config — it changes latency budgets, cost shape, prompt structure, eval harness. Treating it as “the same agent with screenshots” is how you end up with a 10x cost regression in production.

Done right, perception is a tight pipeline with clear contracts: the runtime knows what it captured, what format the model expects, and how the model’s outputs map back onto the world.

How it works#

Modalities and their encodings#

  • Text. The native modality. Easy to include in the prompt, easy to log, easy to diff. The main work is structuring the text — JSON for API responses, markdown for human-facing content, line-numbered code for diffs.
  • Vision. Screenshots, photos, charts, document scans. Encoded as base64 or URLs to vision-capable models. Two pitfalls: resolution (downsampling drops detail the agent then can’t see) and reference (the model can describe what’s in the image but can’t act on coordinates unless the tool surface accepts them).
  • Audio. Speech and ambient sound. Usually transcribed to text via a speech model before being handed to the agent, because reasoning over raw audio embeddings is still narrow. Loses prosody, but gains debuggability.
  • Structured data. Database rows, API payloads, telemetry. Almost always converted to a textual representation — JSON or a markdown table — before going into the prompt. The conversion is where ambiguity sneaks in: a date as "2026-05-17" vs "May 17, 2026" produces different reasoning.
  • Sensor / continuous. IoT telemetry, video, robot proprioception. Usually summarised by a downstream model or rules engine into discrete events the agent can reason over — “motion detected at 14:02”, not 1080p frames.

Grounding mechanisms#

Grounding is the bridge between what the model sees and what the agent does. Three common mechanisms:

  1. Set-of-Marks (SoM) for vision. Overlay numbered bounding boxes on the screenshot before sending it to the model. The model picks an action like click(mark=7) instead of click(x=412, y=583). Translates a pixel-grounding problem into a discrete-choice problem the LLM handles better. WebVoyager-style web agents lean on this heavily.
  2. DOM / accessibility-tree handles. For browser agents, supplement the screenshot with a structured tree of interactive elements, each with a stable ID. The agent emits click(id=button-submit-42). Robust to layout changes; brittle to dynamically rendered IDs.
  3. Entity references in text/structured input. Don’t ask the agent to refer to “the user named John from earlier”; assign IDs (user_id=4421) on ingestion and require the agent to use them. Eliminates the “which one did you mean” failure mode that string-matching introduces.

The perception pipeline#

A robust perception pipeline has four stages:

  1. Capture. Take a screenshot, scrape a page, pull a row, transcribe audio. Time-bound; missing or stale captures are a real failure mode.
  2. Normalise. Resize images, sanitise HTML, strip PII, redact secrets, convert timestamps. Deterministic, ideally pure-function.
  3. Encode. Format for the model — JSON, markdown table, base64 image, line-numbered code block. This is where the format-vs-content trade-off lives.
  4. Reference. Decide how the agent will refer back to objects in the observation when it acts — Set-of-Marks indices, IDs, line numbers, JSON pointers.

The contract between step 3 and step 4 is what makes grounding work. If the encoding doesn’t expose stable references, the agent has no choice but to free-text its way back to the world — and free-text grounding is exactly where it goes wrong.

Variants and trade-offs#

Text-only perception — everything the agent sees is text. The simplest pipeline: scrape pages to markdown, dump API responses as JSON, transcribe audio. Predictable cost and latency, easy to log and diff, deterministic eval. Limits: loses spatial/visual structure, fails on chart-heavy or visually-rendered content.
Multimodal perception — vision plus text plus structured data. Required for web/UI agents, document understanding, real-world tasks. Higher cost per step, slower, eval needs to grade against pixel-accurate ground truth. Grounding becomes its own subsystem (Set-of-Marks, accessibility trees, OCR overlays).

Other axes:

  • Snapshot vs streaming. A screenshot is a snapshot; a video stream is continuous. Agents almost always operate on snapshots because reasoning per-frame is cost-prohibitive; the runtime decides when to capture.
  • Raw vs preprocessed observations. Raw is honest (the model sees what’s there); preprocessed (extracted entities, normalised text) is cheaper and more reliable but assumes the preprocessor is correct. Most production systems do both: preprocessed by default, raw available on request.
  • Synchronous vs async perception. Sync waits for the capture before reasoning; async lets the agent reason on stale observations while a fresh capture is in flight. Async helps long loops; introduces correctness questions when the world changes between capture and action.
  • Self-perception (introspection). The agent’s own past actions and internal state count as observations too. A <scratchpad> block, a recent-actions log, or a memory readout all feed into perception. Most agents under-include this.
Perception pipelines in three real agents
  • WebVoyager-style web agent. Capture: Playwright screenshot + accessibility tree. Normalise: downsample to 1024 wide, redact obvious PII. Encode: image + numbered bounding-box overlay (Set-of-Marks). Reference: click(mark=N), type(mark=N, text=...). Grounding gets stable references for free.
  • Coding agent. Capture: read file, run command, search codebase. Normalise: line numbers added, large files truncated with markers. Encode: markdown code blocks with language hints. Reference: (file, line_start, line_end) tuples in every edit tool.
  • Voice assistant. Capture: 16kHz audio buffer. Normalise: VAD trims silence, speaker diarisation tags turns. Encode: transcribed text plus a [speaker: user, confidence: 0.92] annotation. Reference: utterance IDs in the conversation log.

Same four-stage pipeline; the details that vary are exactly the ones that decide whether the agent works.

When this is asked in interviews#

Perception comes up most in design-heavy AI loops, especially anything involving the web, documents, or real-world environments.

  • AI-product design loops. “Design a web agent” — perception is half the answer. What does the agent see (screenshot, DOM, both)? How does it refer to elements? What’s the latency budget per capture?
  • ML engineering loops. “How do you handle multimodal input?” — they want the pipeline (capture, normalise, encode, reference), not just “we send images to GPT-4o”.
  • Senior backend / platform loops. “Where does grounding live in your stack?” — typically in the tool-surface layer (tools take typed references, not free strings) and the perception pipeline (stable IDs assigned on capture).

Common follow-ups:

  • “What’s the failure mode when grounding goes wrong?” — silent: the agent does the wrong action coherently. Mitigation: verification tools (read-after-write), tight tool input schemas, human review on irreversible actions.
  • “How do you evaluate perception quality?” — separately from the agent. Hold out a labelled set of (screenshot, correct element ID) pairs and measure top-1 accuracy of the grounding pipeline before you blame the model.
  • “Why not just give the model raw coordinates?” — because pixel-precise output from LLMs is unreliable; discretising via Set-of-Marks turns regression into classification, which is where LLMs are strongest.
  • “How would you keep the perception layer cheap?” — capture lazily (only when a tool requires it), cache stable observations across steps, downsample images to the smallest size the model can still read, prefer structured input over rendered when both are available.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.