Autonomous AI Agents
Tool use, planning, memory, multi-step loops. What's hard about turning a language model into something that takes actions.
Use cases#
An autonomous AI agent is a language model wrapped in a loop that calls tools, observes results, updates state, and decides the next action. Where a prompt-engineered system has one model call, an agent has many — sometimes a hundred — chained together until the task is done.
The cases where agents genuinely earn their complexity:
- Multi-step research tasks — gather data from N sources, reconcile it, write a report. Each step depends on the previous one’s output; you can’t precompute the plan.
- Software engineering loops — read code, edit a file, run tests, read failures, edit again. The exit condition is “tests pass”, not a fixed step count.
- Operational workflows with branching — triage a support ticket, look up customer history, attempt a refund, escalate on policy hit. The path differs per case.
- Browser and OS automation — drive a UI by reading screenshots and emitting clicks. The agent reacts to what’s on screen, not a recorded script.
- Data extraction across heterogeneous sources — pull the same fields from PDFs, web pages, and spreadsheets where each requires a different parsing strategy.
Agents are a poor fit when the workflow is well-known (just write the code), when latency matters more than autonomy (a chat reply with retrieval beats a 30-second agent), or when failure is unsafe and human-in-the-loop isn’t available.
System overview#
A working agent has more boxes than the LLM call. The runtime, the tool registry, and the memory store are co-equal components:
[User goal] │ ▼[Agent runtime] │ ▼┌──────────────────────────────────────┐│ Loop until done or step-limit: ││ ││ [Build prompt] ││ - system: role, tools, policies ││ - memory: short-term scratchpad ││ - retrieved: long-term memory ││ - history: prior steps + results ││ │ ││ ▼ ││ [LLM call: emit thought + action] ││ │ ││ ▼ ││ [Parse action] ││ - tool name ││ - tool arguments ││ │ ││ ▼ ││ [Tool executor] ││ - permission check ││ - rate / cost guard ││ - call external API or function ││ │ ││ ▼ ││ [Observation back to history] ││ │ ││ ▼ ││ [Termination check] │└──────────────────────────────────────┘ │ ▼[Final answer + trace]The runtime owns the loop, the budget, the tool registry, and the persistence. The LLM is the policy that decides one step at a time.
Key components#
The tool registry#
Tools are the agent’s hands. Each tool has a name, a typed input schema, an output shape, and a description that the model reads at decision time.
Working practices:
- Keep tools narrow and named for verbs.
get_customer_by_emailbeatsquery_database. The model picks tools by name far more reliably when the name encodes intent. - Description matters more than code. The model never sees the implementation; it sees the description. Spend the same energy on it that you would on a public API doc.
- Strict typed inputs. Use the provider’s tool-calling API (OpenAI, Anthropic, Gemini) so the schema is enforced. Free-text arg parsing breaks weekly.
- Idempotency where possible. Tools that retry safely without side effects are easier to wrap in a runtime that recovers from partial failures.
- Output truncation. A tool that returns 200 KB of HTML poisons the next prompt. Truncate at the boundary or return a handle the agent can deref later.
A useful upper bound: most production agents work well with 8 to 25 tools. Past 50, model accuracy on tool selection drops sharply, and the system prompt grows past comfortable context budgets.
Planning#
The simplest agent is reactive: at each step, the model decides the next single action based on history. The ReAct pattern formalizes this as alternating “thought” and “action” tokens.
For deeper tasks, planners help:
- Up-front plan, then execute. The model writes a numbered plan, then the runtime executes each step. Cheap, but inflexible — when step 3 reveals new information that invalidates step 5, the agent doesn’t adapt.
- Plan-and-revise. After each step, ask the model whether the plan needs updating. More tokens, more reliable on long tasks.
- Hierarchical decomposition. A top-level agent assigns subtasks to specialist sub-agents (a “research agent”, a “code agent”). Each sub-agent has a narrower tool set and runs its own loop. Composes well; debugs poorly.
For tasks under 10 steps, reactive ReAct is usually fine. Past that, an explicit planner is worth its tokens.
Memory#
Agents have three kinds of memory, and conflating them is a common failure:
- Short-term scratchpad. The running history of thoughts, actions, and observations within one task. Lives in the prompt context. Bounded by context window.
- Working memory. Named variables the agent writes and reads during the task — “I stored the customer ID at step 2 and reuse it at step 7”. Implemented as a key-value store the agent can write to via a tool.
- Long-term memory. Persisted facts across sessions — user preferences, past resolutions, organizational knowledge. Stored in a vector index or structured store; retrieved at prompt-build time.
When the context window fills, the runtime compacts short-term memory: summarize older steps and replace them with the summary, or evict tool outputs while keeping thoughts. Aggressive compaction is the price of long-horizon tasks.
The runtime#
The runtime is the loop and the safety rails around it. What lives here, not in the model:
- Step budget. Hard cap on iterations. Without this, agents run forever on tasks they can’t solve.
- Token and cost budget. Per-task ceiling; abort and report when exceeded.
- Concurrency limits and rate limits. Per-tool, per-tenant, per-API.
- Permissioning. Which tools the agent can call, which arguments are allowed, which require human approval.
- Tracing and replay. Every step persisted with input, output, latency, cost. Without this, debugging an agent is impossible.
- Failure recovery. Retry on transient tool errors; surface persistent errors back to the model so it can try a different approach.
Observation handling#
Tool outputs arrive in many shapes — JSON, HTML, screenshots, audio. The runtime normalizes them before they hit the prompt:
- Truncate long text to a bounded window.
- Convert HTML to readable text (boilerplate stripped).
- Compress images to a model-readable resolution.
- Tag the observation with the tool name and step number so the model can reference it later (
"as shown in step 3 output").
Implementation patterns#
ReAct (Reason + Act)#
The canonical pattern. At each step the model emits:
Thought: I need to find the customer's order history.Action: get_orders(customer_id="C-9241")The runtime parses the action, executes the tool, and appends the observation. The next step starts with the updated history. ReAct works because the explicit “thought” channel lets the model reason aloud before committing to an action, and the runtime can use those thoughts for tracing and debugging.
Tool-use over JSON-mode#
Modern provider APIs offer first-class tool calling: the model emits a structured tool call as part of its response, the API validates the schema, and the runtime dispatches. This is strictly better than parsing tools out of free-form JSON in completions because the schema enforcement catches malformed calls at the API boundary, not during execution.
Human-in-the-loop for irreversible actions#
Some tools must never run without explicit approval: sending money, deleting data, posting publicly. The pattern is a “propose” tool that emits the action for human review, plus an execute_approved tool that runs only after a sign-off arrives. The runtime gates the second tool behind the approval state.
Reflection#
After the agent thinks it’s done, ask it to critique its own output before returning. The critique can catch missed steps and inconsistencies. The cost is a few extra calls; the benefit is bounded — reflection helps most when the failure modes are visible in the output, less when the agent never gathered the right information in the first place.
Multi-agent orchestration#
Specialist agents under an orchestrator: a “researcher” gathers data, an “analyst” reasons over it, a “writer” produces output. Communication is via a shared scratchpad or explicit message passing. Works for genuinely separable tasks; adds latency and coordination cost. The orchestrator’s prompt grows fast; consider whether a single agent with sharper tools would be simpler.
Trade-offs#
The honest answer is that most production “agents” are 70% hand-coded workflow with an LLM-driven branch or two. Pure autonomy is reserved for tasks where the path genuinely can’t be predicted (research, coding-with-tests, open-ended browsing).
Other axes:
- One large model vs. mixed model cascade. A single frontier model is the simplest. A cascade — a cheap model for tool-arg parsing, a frontier model for planning, a fast model for summarization — can cut cost by 3 to 5 times. Adds engineering complexity.
- Synchronous vs. streaming output. Streaming the agent’s thoughts to a UI is great UX but exposes incoherent intermediate states. Most products buffer the final answer and stream a status line (“Searching documents…”, “Drafting reply…”) instead.
- Persistent vs. ephemeral memory. Persistent memory makes agents helpful across sessions but raises privacy, ACL, and freshness questions. Ephemeral memory is simpler and safer; many products start there and add persistence per-feature when value is proven.
- Provider tool-calling vs. open-weights with custom parsers. Providers handle schema enforcement; open-weights stacks need grammar-constrained decoding or careful parsing. Open weights are catching up — modern open models with strict JSON modes do tool calling well — but the polish gap is real.
Quality and evaluation#
Evaluating agents is harder than evaluating prompts because the answer depends on a sequence of choices. The eval harness has at least four layers:
- Per-tool unit tests. For every tool, an isolated test set: given input X, does the tool return the right thing? Tool bugs masquerade as agent bugs constantly.
- Per-step correctness on traces. Replay a logged trace; at each step, did the agent pick a reasonable action? Often graded by an LLM-as-judge with a rubric.
- End-to-end task success. Given a goal, does the agent finish with a correct answer? Pass / fail per task across a frozen set of representative tasks. The single most important metric.
- Cost and step distribution. Median and p95 steps per task, cost per task, tool-call counts. A regression that doubles steps without changing pass rate is still a regression.
Some agents need a “judging environment” — a sandbox that exposes the same tools as production but can be reset between runs. For coding agents, this is a test suite. For browser agents, this is a stub of the target site. For research agents, this is a frozen snapshot of the corpus they search.
Tracing is the prerequisite for all of this. Tools like LangSmith, Langfuse, Phoenix, Weights & Biases Traces, and provider-native trace UIs persist every step with timing, cost, and IO. A team without tracing is debugging blind.
Common pitfalls#
- No step limit. The agent gets stuck and runs until the bill arrives. Always cap.
- Tool descriptions copied from internal documentation. Internal docs assume context the model doesn’t have. Rewrite descriptions for the model’s perspective: what the tool does, what inputs it expects, what it returns, what error cases mean.
- Letting tool outputs into context unbounded. A scrape of a 50 KB page lands in the next prompt and crowds out the system message. Truncate, summarize, or stash and reference by handle.
- Treating non-determinism as a bug. Two runs of the same task may take different paths. Asserting byte-exact equality across runs is wrong; assert behavioral equivalence (same final answer, same outcomes on side effects).
- Mixing read and write tools casually. Writes have side effects. Read tools should be liberally available; writes should be gated by permission, budget, and approval flows.
- Letting the model decide when to stop. The model is optimistic — it declares completion before the task is done. Combine a model-emitted “done” signal with an external check (tests passing, schema validation, human review).
- Skipping the trace UI. “Logs are enough” is wrong for agents. You need step-by-step replay with timing and IO to make sense of a 30-step trace. Build the UI early.
- One huge system prompt for all tasks. A 3 KB system prompt that lists every tool and every policy is a quality killer. Route by task type; load only the tools and policies the current task needs.
Related applications#
For the deeper engineering view on agents as products — runtime architecture, evaluation harnesses, safety boundaries, and the day-2 operations stack — see the Agentic workbook. The Gen AI workbook treats agents as one application; the Agentic workbook treats them as a system to be engineered end-to-end.
The minimum viable agent runtime in 80 lines
A working agent runtime is shorter than people expect. The hard parts aren’t the loop. The hard parts are: a tool registry with typed schemas, a tracer that captures every call, a budget guard that aborts when costs exceed a cap, a context compactor that summarizes old steps when the window fills, and a permission layer that gates dangerous tools. If you can stand up those five pieces, you have a runtime. Everything beyond that — multi-agent orchestration, planners, reflection — is a pattern that runs on top, not a primitive you need on day one. The teams shipping agents in production today are mostly running 200-line runtimes around frontier-model tool calls. The complexity lives in the tool implementations and the eval harness, not the runtime.