Function Calling and Tool Use — Agentic

What it is#

Function calling is the pattern by which a language model emits a structured, schema-typed request to invoke a named function in the host program, receives the function’s return value back as a message, and continues the conversation with that result in context. It is the verb of the agent: where ReAct is the loop shape, function calling is what fills the Action slot.

The pattern looks deceptively simple — “the model outputs JSON that names a function and its arguments” — but the discipline around the JSON is what makes it work. The model is constrained at decoding time (or at minimum heavily prompted) to produce output that matches a tool schema. The orchestrator parses that output, validates it against the schema, executes the underlying function, and feeds the result back. There is no free-form parsing, no regex over prose, no “did the model mean to call the API or just talk about it?” ambiguity.

Function calling is the single most consequential pattern shift between pre-2023 LLM apps and modern agents. Before it, every tool integration involved parsing the model’s natural-language output and hoping. After it, tool calls became a typed transport, with schemas, validators, retries, and parallelism — the same affordances any other RPC system has.

When to use it#

Use function calling whenever the agent needs to do something in the world beyond producing text:

Reading external state. Fetch a record, query a database, list files, check a calendar, get the weather. These are the canonical examples — pure-read operations with well-defined inputs and outputs.
Mutating external state. Send an email, update a row, create a ticket, post a message. The schema becomes load-bearing because typos cost real money or break real systems.
Triggering computation. Run a SQL query, evaluate a Python snippet, kick off a job. The function abstracts the compute; the model decides when and how to invoke it.
Routing to a sub-system. A “function” can be the entry point to another agent, a retrieval pipeline, a vector search, or a non-LLM service. The model doesn’t need to know what’s inside — just the schema.

Don’t lean on function calling when:

There is no environment to call into. A pure-text classification or summarisation task doesn’t need tools. Asking the model to “call a function” to produce its answer is overhead.
The tool surface is too small. If there’s exactly one function and it’s always called once per request, function calling is heavier than a plain structured-output prompt.
The schema is unstable. Function-calling decoding is most reliable when the schema is fixed and well-documented; rapidly-changing schemas defeat the model’s training prior and increase malformed-call rates.

How it works#

The lifecycle of a single tool call#

A typical call flows through six steps:

Tool registration. The host program declares each tool with { name, description, input_schema } — usually JSON Schema. The descriptions and schemas are concatenated into the system prompt or sent through a dedicated tools parameter, depending on the API.
Decision. The model decides whether to call a tool or to respond directly. Modern APIs surface this as a top-level choice: either the response is a tool_use block, or it is plain text (or both — see parallel calls below).
Argument synthesis. The model emits the tool’s arguments as JSON conforming to input_schema. Constrained decoding (or strong fine-tuning) prevents most malformed JSON; what’s left to validate is semantic correctness — types match, required fields present, enums respected.
Validation and dispatch. The orchestrator validates the JSON against the schema. On validation failure, it can either feed the error back to the model (so it can retry) or fail hard. On success, it dispatches the call to the registered handler.
Execution. The handler runs — could be a synchronous local function, an async HTTP request, a shell-out, an SQL query, anything. Errors here are wrapped and sent back as a tool result.
Result injection. The result is appended to the conversation as a tool_result message keyed to the original tool_use_id. The model sees it on its next turn and proceeds.

Schemas as contracts#

The tool schema is doing three jobs at once:

It tells the model what to emit. Field names, types, and constraints are training-prior anchors; well-named fields get well-typed values.
It tells the orchestrator what to validate. The same schema that constrains the model also validates the parsed output before dispatch.
It documents the API for human reviewers. When the agent’s tool calls show up in a log, the schema is the reference manual.

Good schemas are descriptive (every field has a description), conservative (no optional fields when required is meant), and bounded (enums beat free-form strings whenever the set is closed). Bad schemas are vague (args: object), excessively permissive (everything optional), or use names that conflict with the model’s training prior (a delete field that means “soft-delete” will surprise the model half the time).

Parallel tool calls#

Modern APIs allow the model to emit multiple tool_use blocks in a single response when the calls are independent. The orchestrator dispatches them concurrently and returns all results together on the next turn. This is the single biggest latency win available to ReAct-shaped agents:

A research agent that needs three searches issues all three in one turn instead of three.
A coding agent that needs to read five files reads them in parallel.
A travel agent that checks flights, hotels, and weather batches them.

The model has to know the calls are independent — that’s a reasoning step. Models trained on parallel-call traces do this well; older models sometimes serialise unnecessarily. Prompt scaffolding can help: “If multiple lookups are independent, issue them in parallel.”

The orchestrator has to handle partial failure: if one of three parallel calls errors, the other two results still flow back, and the model decides what to do with the partial picture.

Structured outputs vs tool calls#

These two features are siblings, often confused:

Tool call — the model says “invoke f with these args”; the host runs f and the conversation continues.
Structured output — the model produces JSON conforming to a schema, but it is the final output, not an invocation. The conversation ends or moves on with that JSON as the value.

Some APIs let a tool’s schema double as a structured-output schema by treating a “respond with this JSON” action as the terminal tool. Others have a dedicated response_format parameter. Either way, the underlying mechanism — constrained decoding against a schema — is the same.

Tool choice modes#

Every modern API has a tool_choice parameter with at least these modes:

auto — the model decides whether to call a tool or respond. The default for most agents.
required / any — the model must call at least one tool. Useful for forcing structured behaviour at specific points in a flow.
{ type: 'tool', name: 'X' } — the model must call tool X. Useful for stage-gated flows where the next step is known.
none — no tool calls allowed; the model must respond in plain text.

The mode is a runtime knob, not a baked-in property. A planner agent might be invoked with required on the planning step and auto on the execution steps. A safety reviewer might be invoked with a fixed tool to enforce the output shape.

Error handling and retries#

Tool calls fail in characteristic ways:

Schema validation fails — the model emitted bad JSON or wrong types. Feed the error back; the model retries with the correction.
Handler raises an exception — the underlying service is down or returned an error. Wrap the exception as a tool result; the model decides whether to retry, fall back, or abort.
Tool times out — the orchestrator enforces a per-call timeout. Surface it as a tool result; the model can try a cheaper alternative.
Model emits nonsense args — the schema is too loose to catch it. Tighten the schema, add explicit constraints, or post-validate in the handler before executing the underlying action.

The pattern that works: every failure mode produces a tool result, never an orchestrator-side raise that interrupts the loop. The model is allowed to see what went wrong and to adapt. The orchestrator’s job is to make failures legible, not to recover on the model’s behalf.

Variants#

The base pattern — register schemas, decode against them, dispatch — has many surface variations:

JSON Schema tool calling. The dominant flavour. Tool inputs are JSON Schema documents; the model’s output is validated against them. Used by Claude, OpenAI, Gemini, and most providers.
Pydantic / type-derived tool calling. Frameworks like LangChain and the Anthropic SDK can derive JSON Schema from Pydantic models or Python type hints, letting developers define tools as ordinary functions with annotations. The schema is generated; the model still sees JSON Schema.
Code-as-tool. Instead of one function per tool, the model is given a Python interpreter and writes code that calls helper libraries. More expressive but harder to sandbox. Used by some coding agents and data-science agents.
MCP (Model Context Protocol). A wire protocol that decouples tool servers from tool clients. The model client speaks MCP to one or more tool servers, each exposing its own tools. The schema discipline is the same; the deployment shape is different.
Streaming tool calls. The model streams the tool_use block character by character; some orchestrators kick off speculative execution before the full arguments arrive. Saves latency for high-throughput agents.
Plan-then-call. A two-step pattern: first the model emits a plan (a list of intended tool calls with rationales), then it executes each in sequence. The plan is a tool call against a planner-tool schema. Useful for long-running agents where mid-flight introspection matters.

Why JSON Schema and not a typed language like Protocol Buffers?

JSON Schema won because it is what the model’s training corpus is full of. Models have seen orders of magnitude more JSON Schema examples than .proto files, and the format is human-writable in a single file with no toolchain. Protobuf or GraphQL would give you stronger guarantees, but they would also require the model to fluently emit a less-familiar syntax, which empirically hurts call success rates. The right level for tool schemas is the level where the model is most fluent.

Example systems#

Function calling is universal in modern agents; nearly every system in this workbook uses it as the underlying call mechanism:

WebVoyager — tool surface is click, type, scroll, goto, screenshot, plus a respond terminal. The function-calling schema is what keeps the action space legible and bounded.
NVIDIA Eureka — the inner reward-generation step is a structured-output call: the model emits Python code conforming to a reward_function schema, which the host then evaluates in simulation.
ChainBuddy — multi-agent pipeline generation where each agent’s contribution is a typed call against a pipeline-schema tool. The schema is doing the structural work that natural-language hand-off cannot.
OpenClaw — calendar, mail, search, and reminders are each registered as tools; the model routes user intent into the right call.

When you read about agent frameworks (Google ADK, LangChain, AutoGen) the framework’s job is largely making function calling ergonomic — tool registries, schema derivation, error wrapping, parallel dispatch. The pattern is the same across all of them.

Trade-offs#

Schema-typed tool calls — robust against malformed output; trivially validated; parallel-callable; legible in logs. Higher upfront cost (schemas to write); model must be trained on tool-use traces; schema-quality is load-bearing.

Free-form text + regex parsing — zero schema setup; works with any model; flexible. Brittle against output drift; no parallelism; no validation; debugging is forensic. Only acceptable for prototypes or one-off tools.

Other trade-offs worth flagging:

Few, well-described tools vs many, narrowly-scoped tools. Past about 20 tools in the surface, model performance starts to degrade — the model has to spend attention on tool selection. Either consolidate (one search tool with a mode argument beats search_docs, search_web, search_code) or hierarchically partition (a router agent picks a sub-agent, which has its own tool surface).
Synchronous vs asynchronous tools. Synchronous tools block the loop; long-running tools (data export, batch job) need an async pattern — the call returns a job ID, and the model polls or subscribes. Async tools complicate the orchestrator but unblock the agent.
Trust boundary at the tool. A tool that calls a third-party API on a writable scope is a security surface. Schema constraints in the tool catch some misuse; the safety layer (allowlists, rate limits, dry-runs) catches the rest. Don’t rely on the model to refuse a dangerous call — it sometimes will, often won’t.