Perception and Grounding
How agents take input — text, vision, audio, structured data — and ground it to actions in their environment.
Summary#
Perception is how an agent takes in the world: text, screenshots, audio, sensor streams, structured payloads. Grounding is the harder problem of mapping that perception onto the actions the agent can actually take — turning a pixel at position (x, y) into “click the submit button”, or turning a JSON blob into “the user wants order #4421 refunded”.
A model can be brilliant at reasoning and still fail catastrophically if perception is noisy or grounding is wrong. The agent doesn’t see the environment directly; it sees whatever the runtime feeds in. That makes the perception pipeline — what gets captured, how it’s encoded, how it’s referenced in the prompt — load-bearing in a way that’s easy to underestimate.
Why it matters#
Three reasons perception-and-grounding is where most “real-world agent” projects go off the rails:
- Reasoning is only as good as observations. A web agent that gets a fuzzy screenshot and a flattened DOM blob will misclick. A diagnosis agent that gets a free-text summary instead of the raw labs will miss findings. Bad perception silently degrades every downstream step.
- Grounding errors are silent. If the agent picks the wrong object — clicks the wrong row, edits the wrong file, queries the wrong table — the rest of the trajectory looks coherent but is solving the wrong problem. End-state grading catches this only at the very end.
- Multimodality changes the system, not just the prompt. Adding vision isn’t a one-line config — it changes latency budgets, cost shape, prompt structure, eval harness. Treating it as “the same agent with screenshots” is how you end up with a 10x cost regression in production.
Done right, perception is a tight pipeline with clear contracts: the runtime knows what it captured, what format the model expects, and how the model’s outputs map back onto the world.
How it works#
Modalities and their encodings#
- Text. The native modality. Easy to include in the prompt, easy to log, easy to diff. The main work is structuring the text — JSON for API responses, markdown for human-facing content, line-numbered code for diffs.
- Vision. Screenshots, photos, charts, document scans. Encoded as base64 or URLs to vision-capable models. Two pitfalls: resolution (downsampling drops detail the agent then can’t see) and reference (the model can describe what’s in the image but can’t act on coordinates unless the tool surface accepts them).
- Audio. Speech and ambient sound. Usually transcribed to text via a speech model before being handed to the agent, because reasoning over raw audio embeddings is still narrow. Loses prosody, but gains debuggability.
- Structured data. Database rows, API payloads, telemetry. Almost always converted to a textual representation — JSON or a markdown table — before going into the prompt. The conversion is where ambiguity sneaks in: a date as
"2026-05-17"vs"May 17, 2026"produces different reasoning. - Sensor / continuous. IoT telemetry, video, robot proprioception. Usually summarised by a downstream model or rules engine into discrete events the agent can reason over — “motion detected at 14:02”, not 1080p frames.
Grounding mechanisms#
Grounding is the bridge between what the model sees and what the agent does. Three common mechanisms:
- Set-of-Marks (SoM) for vision. Overlay numbered bounding boxes on the screenshot before sending it to the model. The model picks an action like
click(mark=7)instead ofclick(x=412, y=583). Translates a pixel-grounding problem into a discrete-choice problem the LLM handles better. WebVoyager-style web agents lean on this heavily. - DOM / accessibility-tree handles. For browser agents, supplement the screenshot with a structured tree of interactive elements, each with a stable ID. The agent emits
click(id=button-submit-42). Robust to layout changes; brittle to dynamically rendered IDs. - Entity references in text/structured input. Don’t ask the agent to refer to “the user named John from earlier”; assign IDs (
user_id=4421) on ingestion and require the agent to use them. Eliminates the “which one did you mean” failure mode that string-matching introduces.
The perception pipeline#
A robust perception pipeline has four stages:
- Capture. Take a screenshot, scrape a page, pull a row, transcribe audio. Time-bound; missing or stale captures are a real failure mode.
- Normalise. Resize images, sanitise HTML, strip PII, redact secrets, convert timestamps. Deterministic, ideally pure-function.
- Encode. Format for the model — JSON, markdown table, base64 image, line-numbered code block. This is where the format-vs-content trade-off lives.
- Reference. Decide how the agent will refer back to objects in the observation when it acts — Set-of-Marks indices, IDs, line numbers, JSON pointers.
The contract between step 3 and step 4 is what makes grounding work. If the encoding doesn’t expose stable references, the agent has no choice but to free-text its way back to the world — and free-text grounding is exactly where it goes wrong.
Variants and trade-offs#
Other axes:
- Snapshot vs streaming. A screenshot is a snapshot; a video stream is continuous. Agents almost always operate on snapshots because reasoning per-frame is cost-prohibitive; the runtime decides when to capture.
- Raw vs preprocessed observations. Raw is honest (the model sees what’s there); preprocessed (extracted entities, normalised text) is cheaper and more reliable but assumes the preprocessor is correct. Most production systems do both: preprocessed by default, raw available on request.
- Synchronous vs async perception. Sync waits for the capture before reasoning; async lets the agent reason on stale observations while a fresh capture is in flight. Async helps long loops; introduces correctness questions when the world changes between capture and action.
- Self-perception (introspection). The agent’s own past actions and internal state count as observations too. A
<scratchpad>block, a recent-actions log, or a memory readout all feed into perception. Most agents under-include this.
Perception pipelines in three real agents
- WebVoyager-style web agent. Capture: Playwright screenshot + accessibility tree. Normalise: downsample to 1024 wide, redact obvious PII. Encode: image + numbered bounding-box overlay (Set-of-Marks). Reference:
click(mark=N),type(mark=N, text=...). Grounding gets stable references for free. - Coding agent. Capture: read file, run command, search codebase. Normalise: line numbers added, large files truncated with markers. Encode: markdown code blocks with language hints. Reference:
(file, line_start, line_end)tuples in every edit tool. - Voice assistant. Capture: 16kHz audio buffer. Normalise: VAD trims silence, speaker diarisation tags turns. Encode: transcribed text plus a
[speaker: user, confidence: 0.92]annotation. Reference: utterance IDs in the conversation log.
Same four-stage pipeline; the details that vary are exactly the ones that decide whether the agent works.
When this is asked in interviews#
Perception comes up most in design-heavy AI loops, especially anything involving the web, documents, or real-world environments.
- AI-product design loops. “Design a web agent” — perception is half the answer. What does the agent see (screenshot, DOM, both)? How does it refer to elements? What’s the latency budget per capture?
- ML engineering loops. “How do you handle multimodal input?” — they want the pipeline (capture, normalise, encode, reference), not just “we send images to GPT-4o”.
- Senior backend / platform loops. “Where does grounding live in your stack?” — typically in the tool-surface layer (tools take typed references, not free strings) and the perception pipeline (stable IDs assigned on capture).
Common follow-ups:
- “What’s the failure mode when grounding goes wrong?” — silent: the agent does the wrong action coherently. Mitigation: verification tools (read-after-write), tight tool input schemas, human review on irreversible actions.
- “How do you evaluate perception quality?” — separately from the agent. Hold out a labelled set of (screenshot, correct element ID) pairs and measure top-1 accuracy of the grounding pipeline before you blame the model.
- “Why not just give the model raw coordinates?” — because pixel-precise output from LLMs is unreliable; discretising via Set-of-Marks turns regression into classification, which is where LLMs are strongest.
- “How would you keep the perception layer cheap?” — capture lazily (only when a tool requires it), cache stable observations across steps, downsample images to the smallest size the model can still read, prefer structured input over rendered when both are available.
Related concepts#