Action and Tool Use — Agentic · Engineering Playbook

Summary#

The action surface is the set of effects an agent can produce in the world. In LLM-based agents, actions are almost always mediated by tools — typed functions the model can call. A tool has a name, a description, an input schema, and an implementation the runtime executes. The model never touches the outside world directly; it emits structured tool calls, and a runtime decides whether and how to execute them.

This indirection is the entire reason agentic systems can be made safe and debuggable. The model proposes; the runtime disposes. Every action goes through a single chokepoint — schema validation, permission checks, logging, rate limits, retries — before it touches anything real. Designing that chokepoint well is the difference between an agent you can deploy and one you can only demo.

Why it matters#

Three reasons action design is where most production-readiness work happens:

Actions have blast radius. A bad answer wastes a token; a bad action sends an email, drops a table, charges a card. The cost of being wrong is asymmetric, and the architecture has to match that asymmetry.
Tool design is the highest-leverage prompt-engineering you’ll do. Renaming a tool, tightening its schema, or adding a one-line description can move success rate more than a model upgrade. The model’s behaviour is shaped at least as much by the tool surface as by the system prompt.
Code execution shifts the safety frontier. The moment an agent can run arbitrary code, the threat model changes. Sandboxing, network egress controls, resource limits, and audit logs become non-optional.

The “Action” half of the perception-reason-act loop is where the agent commits to a state of the world. Until then, all reasoning is hypothetical.

How it works#

The anatomy of a tool#

A tool has four parts the runtime cares about:

Name. A short, model-readable identifier — send_email, query_orders, run_python. Names are part of the prompt; the model picks tools by name, so collisions and ambiguity hurt.
Description. A natural-language docstring explaining when to use the tool, what it returns, and any constraints. The single most under-engineered field in most agent codebases. A precise description shifts call-rates dramatically.
Input schema. Usually JSON Schema, declaring required/optional fields, types, enums, and constraints. The model emits arguments that conform; the runtime validates before executing.
Implementation. The runtime-side function that actually does the work — calls an API, queries a DB, runs a shell command. Returns a result the runtime serialises into the next observation.

The contract between model and runtime is the schema. Everything else — retries, validation errors, partial results — flows back to the model as a structured observation.

The action lifecycle#

A single tool call goes through six stages:

Proposal. The model emits a tool call: {tool: "send_email", args: {to: ..., subject: ..., body: ...}}. May be one of several parallel calls in the same response.
Validation. The runtime checks the schema. Missing fields, type mismatches, or constraint violations short-circuit before any side effect.
Authorisation. Permission check: is this agent, in this context, allowed to call this tool with these arguments? Read-only by default, write-tools gated by allowlists or human confirmation.
Execution. The runtime invokes the implementation. Timeouts, retries, rate limits, circuit breakers all live here.
Result encoding. The implementation returns a value; the runtime serialises it into an observation the model can parse — usually JSON or markdown, possibly truncated if huge.
Logging. Every call (proposal, args, result, latency, cost) is logged. Without this, agentic incidents are impossible to triage.

The model only sees stages 1 and 5. Stages 2-4 and 6 are entirely the runtime’s responsibility — and they’re where most of the operational value of the agent platform lives.

Modern tool-calling APIs#

Function calling has standardised across providers around the same shape: register a list of tools, the model emits zero-or-more tool calls per response, the runtime executes and replies with tool results. Three features worth using:

Parallel tool calls. The model can emit multiple independent calls in one response. The runtime executes them concurrently; total wall-clock latency is the slowest call, not the sum.
Strict / structured output. The model is constrained to emit only valid JSON conforming to the schema. Eliminates parse failures; can slightly reduce flexibility on complex schemas.
Streaming tool calls. Arguments stream token-by-token. Useful for long arguments (large code blocks, long emails); the runtime can start work before the full call completes, though most workflows wait for completion.

Code execution as a tool#

A special case worth calling out. A run_python (or run_bash, run_javascript) tool lets the model write and execute arbitrary code in a sandbox. Three things change:

Capability. The model can do anything the sandbox allows — data analysis, file manipulation, web scraping, math. Far broader than a fixed tool set.
Threat model. Arbitrary code is arbitrary attack surface. The sandbox must restrict filesystem, network, syscalls, CPU/memory. Treat it like an untrusted user submission.
Debuggability. Code is reviewable; the runtime can log every script and its output. Often easier to audit than a long chain of typed tool calls.

Most production agents combine both: typed tools for high-frequency well-defined actions (send email, query orders), code execution for the long tail of one-off analysis. This is the MRKL / code-as-tool pattern.

Variants and trade-offs#

Typed function calls — fixed catalogue of pre-registered tools with strict schemas. Easy to audit, gate, and rate-limit per tool. Predictable cost. Limited by what the developer pre-registered; novel tasks need new tools.

Code execution — the model writes and runs code in a sandbox. Handles the long tail of unforeseen tasks. Costs: harder to gate per “action” (one script can do many things), larger attack surface, more complex sandboxing requirements.

Other axes:

Read-only vs write tools. Read tools (search, query, fetch) are usually safe to expose broadly with cheap permission checks. Write tools (send, create, delete, charge) need explicit allowlists, possibly human confirmation, and stricter logging.
Idempotent vs non-idempotent. Idempotent tools (overwrite a file with given content, upsert a row by key) are safe to retry. Non-idempotent tools (send an email, charge a card) need exactly-once semantics — request IDs, dedupe windows, transaction tokens.
Sync vs async actions. Sync tools return a result before the next model call. Async tools (kick off a job, send a webhook) return a handle; the agent has to check status later. Most agents under-design the async case; the right pattern is a wait_for(job_id) or status-polling tool.
Deterministic vs stochastic tools. Most tools are deterministic. Some — generate_image, run_search, summarise — are themselves probabilistic. The agent has to handle “I called this twice and got different results”; the runtime can stabilise by caching results within a session.

A few tool-design patterns worth borrowing

Verification tools. Pair every write with a read. update_record is followed by read_record in the next step to confirm. Catches silent-failure modes.
Composite tools that do one thing. Resist “kitchen sink” tools that take a mode argument and dispatch to many code paths. The model handles many small tools better than a few big ones.
Default-deny side effects. A tool that might have side effects should require an explicit confirm=true argument. The schema forces the model to commit deliberately; the description explains when to use it.
Dry-run modes. For high-blast-radius actions, expose a dry_run flag the model can use to validate inputs without producing effects. Cheap, hugely improves debuggability.
Structured errors, not exceptions. Tool failures should return {ok: false, error_code: ..., message: ...} instead of throwing. The model can read structured errors and recover; raw exceptions confuse the loop.

When this is asked in interviews#

Action design comes up in every senior AI loop, often as a follow-up to “design an agent”.

AI-product loops. “What tools does your agent have, and why those?” — the expected answer covers the catalogue, the read/write split, and the permission model.
Backend / platform loops. “How would you build the tool-execution layer?” — they want to see the validation-auth-execute-log pipeline, plus retries, timeouts, rate limits, and idempotency considerations.
Security / staff loops. “What’s the attack surface of your agent?” — prompt injection that triggers a write tool, exfiltration via a search tool, resource exhaustion via code execution. The candidate who lists all three is the one who’s shipped.

Common follow-ups:

“How do you handle a non-idempotent action being retried?” — request IDs at the API layer, dedupe windows in the runtime, explicit transaction tokens for things like payments.
“Code execution or typed tools?” — both, and you should be able to articulate where each wins. Typed for high-frequency, code for long-tail. Same agent, different tool surfaces.
“How do you gate destructive actions?” — allowlists per environment (no delete in prod from a chat agent), human-in-the-loop confirmation on irreversible writes, dry-run defaults, post-hoc audit logs.
“What’s the worst incident you’ve seen?” — strong answers describe a real failure (wrong row updated, mass email sent, runaway loop) and the systemic fix (schema tightening, allowlist, exit-condition tightening), not just “we added a confirmation step”.

The strongest signal you can give: don’t talk about tools as if they’re an LLM concern. They’re a platform concern. The model is one consumer of the tool layer; the same layer should support replay, dry runs, eval harnesses, and (eventually) other agents.