Agent Architecture Overview
The four-component reference architecture: model, tools, memory, instructions. How requests flow through each.
Summary#
Almost every production agent — from a coding assistant to a multi-step research bot — reduces to the same four-part reference architecture: a model that reasons, a tool surface that acts, a memory that persists state, and an instruction set that constrains behaviour. A runtime loop ties them together: the model reads memory plus the latest observation, picks a tool (or chooses to stop), the runtime executes that tool and writes the result back into memory, and the loop iterates until an exit condition fires.
The picture is deliberately boring. The interesting variation is in how each piece is wired — what kind of memory, which tools, how strict the instructions, what triggers the exit. Master that four-part decomposition and most agent designs become legible: you can point at any system and label its model, tools, memory, and instructions, and the failure modes follow.
Why it matters#
A clean architecture is the difference between an agent you can debug and one you can only re-roll. Three concrete reasons it pays off:
- It localises failures. If outputs are wrong, you instrument the model. If the wrong tool fires, you look at the tool schema and the instruction set. If the agent forgets a critical fact between steps, the memory layer is the suspect. Without the decomposition, every bug is “the LLM did something weird”.
- It guides cost and latency optimisation. Cheap model + many tool calls is a very different cost shape than expensive model + few tool calls. Knowing the components lets you pick where to spend.
- It transfers across frameworks. LangChain, AutoGen, LangGraph, Google ADK, custom code — they all assemble the same four pieces. Once you can read an agent in these terms, switching frameworks is a syntax exercise.
It also makes interview answers crisper. “Walk me through your agent” becomes a 30-second tour of the four components plus the loop, instead of a ramble through the prompt.
How it works#
The four components, one more time#
- Model. The LLM doing the reasoning step. Inputs: the system prompt, the conversation/loop history, the tool descriptions, the current observation. Output: either a final answer or a structured tool call. Choice of model fixes capability ceiling, latency floor, and cost shape. Most agents use a single model; some route between a cheap model for routine steps and a stronger model for hard ones.
- Tools. Typed functions the model can invoke. Each tool has a name, a description, and an input schema (JSON Schema, typically). The runtime — not the model — actually executes the tool and produces the observation. Tools are the only way the agent affects the outside world: read files, query APIs, send mail, run code. The tool surface defines the agent’s verbs.
- Memory. The state the agent carries forward. Short-term: the running context window (system prompt, instructions, history, latest observation). Long-term: an external store the agent can read and write across sessions — vector indexes, key-value stores, files, databases. Memory choices drive how much context the model sees and how much it has to re-derive every step.
- Instructions. The system prompt plus in-context examples plus runtime policies. Defines persona, output format, refusal rules, escalation triggers. Instructions don’t do anything at runtime — they shape every decision the model makes inside the loop.
Request flow through the architecture#
Walk through a single agent step end-to-end:
- Runtime assembles input. It pulls the instruction set, optional retrieved long-term memory, the conversation/loop history (short-term memory), and the latest user message or tool observation. It also attaches the tool catalogue — names, descriptions, JSON schemas.
- Model reasons. The LLM consumes that input and emits a structured response: either content for the user or a tool call (or both — many APIs allow parallel tool calls in one response).
- Runtime dispatches. If a tool call was emitted, the runtime validates arguments against the schema, executes the tool, and captures the result.
- Runtime updates memory. The tool result becomes the next observation; short-term memory grows; the runtime may also write to long-term memory (cache, episodic store, vector index).
- Loop or exit. The runtime checks exit conditions — final answer emitted, step budget hit, success predicate satisfied, escalation required. If none fire, go to step 1.
Where the seams are#
- Between model and tools sits the tool-calling protocol — function calling, JSON mode, structured outputs. Brittle when the model misformats; the runtime needs retries, schema repair, or strict decoding.
- Between model and memory sits the context-window manager — what gets included, what gets summarised, what gets dropped. The hardest seam to get right at long horizons.
- Between tools and the world sits the permission boundary — who can run what, with what credentials, against what blast radius. The single most important safety seam.
- Between instructions and everything else sits the prompt template — the runtime artefact that stitches role, rules, examples, and history into the final model input.
Variants and trade-offs#
Other axes that shape the architecture:
- Stateless vs stateful runtime. Stateless agents re-derive context from durable storage on every call (easier to scale horizontally, harder to maintain working memory). Stateful agents keep an in-process loop alive (cheaper per step, but pinned to a single worker — pod restarts are catastrophic).
- Single-tool-call vs multi-tool-call per step. Parallel tool calls (supported by modern function-calling APIs) cut wall-clock latency when the calls are independent. They also expand the blast radius of a single bad decision — four wrong calls instead of one.
- Code-execution backbone vs typed-tool backbone. A code-executor lets the model write Python/JS and run it in a sandbox; a typed-tool agent only invokes pre-registered functions. Code is more flexible and often more capable on novel tasks; typed tools are easier to audit, gate, and reason about.
- Synchronous vs asynchronous orchestration. Sync agents block the caller until done — fine for short loops, fatal for long ones. Async agents (job IDs, callbacks, streamed progress) are the right shape for anything over ~30 seconds; the architecture changes shape around queues and durable stores.
A few real systems mapped onto the four components
- ReAct-style coding agent. Model: a strong reasoning LLM. Tools: shell, file read/write, search. Memory: ephemeral chat history + a
scratchpad.mdfile. Instructions: “you are a senior engineer; verify before writing; ask before destructive ops.” - WebVoyager. Model: multimodal LLM. Tools: screenshot, click(x, y), type, scroll, navigate. Memory: page snapshots stack + action history. Instructions: web-navigation persona plus task-specific goal.
- MACRS. Model: shared LLM across roles. Tools: candidate retrieval, ranking, reply generator. Memory: dialogue state plus user-preference vectors. Instructions: per-agent role prompts (planner, critic, reflector).
Different domains, same four slots filled differently. That’s the whole architectural payoff.
When this is asked in interviews#
Less often as a standalone question, more often as the frame for any agent design problem. The interviewer is looking for whether you can decompose without prompting.
- AI-platform design loops. “Design an agent for X” — the answer always starts with the four components and the loop. Skip the decomposition and the rest of the answer reads as ad-hoc.
- Senior backend / staff loops. “How would you structure an agentic service in our system?” — they want the architecture diagram: API surface, runtime, model layer, tool layer, memory layer, observability seams.
- AI-product engineering loops. “Walk me through an agent you’ve shipped” — strong candidates point at the four slots, name what’s in each, and explain the loop and exit conditions in the same breath.
Common follow-ups to be ready for:
- “Where would you put the prompt template?” — between instructions and the rest, owned by the runtime, versioned alongside code.
- “What does long-term memory cost you?” — extra latency per step, eventual-consistency bugs, cache-staleness incidents, and a new ops surface (vector store, KV store).
- “Where would caching live in your architecture?” — three places: prompt-prefix caching at the model layer, tool-result caching at the runtime, retrieval caching at the memory layer.
- “How do you migrate to a stronger model?” — separation of concerns; the model is one of four slots, swap it and re-evaluate; the rest of the architecture shouldn’t need to move.
A trap to avoid: do not collapse the four into “the LLM and its prompt”. The reason the architecture is four boxes is precisely so that prompt-engineering is one knob among four. Candidates who treat the whole agent as “more prompt” tend to under-invest in tooling, memory, and observability.
Related concepts#