Agent Memory — Agentic · Engineering Playbook

Summary#

An agent’s memory is everything it can recall when reasoning. Two big buckets: short-term memory (the live context window — system prompt, instructions, conversation history, scratchpad, latest observation) and long-term memory (anything persisted outside the context window — vector indexes, key-value stores, files, databases, fine-tuned weights).

Inside long-term, three flavours borrowed from cognitive science map cleanly onto engineering choices: episodic (what happened — past trajectories, conversation transcripts), semantic (what is true — facts, profiles, knowledge), and procedural (how to do things — learned tools, scripts, workflows). Most production agents use some mix of all three, plus a deliberate strategy for compacting short-term memory so the loop doesn’t hit the context-window ceiling.

Why it matters#

Memory is the most under-designed component of most agents. Three things go wrong when you ignore it:

The agent forgets the goal. Long loops drift because the original task description scrolls out of the window or gets summarised away. The model still reasons, but about the wrong problem.
Cost explodes silently. Every step re-sends the entire history. A 20-step trajectory with growing context bills the prefix 20 times. Without prompt-caching or compaction, the bill is quadratic in steps.
Personalisation is impossible. “Remember I prefer concise answers” is a memory request, not a model request. Without a long-term store keyed by user/session, every conversation starts cold.

Done well, memory turns a one-shot agent into a system that learns user preferences, reuses successful trajectories, and stays grounded over long horizons. Done badly, memory is either the silent cost driver or the silent correctness bug.

How it works#

Short-term memory: the context window#

The context window is the agent’s working memory. Mechanically, it’s just the input you send to the model on each call. Practically, four things live in it:

System / instruction block — persona, rules, tool catalogue.
Task / goal description — what the user asked for.
Trajectory — past steps (model thoughts, tool calls, tool results) ordered chronologically.
Latest observation — the result of the most recent tool call.

As the loop runs, the trajectory grows. Three management strategies show up everywhere:

Sliding window. Keep the most recent N steps; drop older ones. Cheap, simple, lossy.
Summarisation. When the trajectory gets long, call the model to compress earlier steps into a paragraph. Adds latency, preserves more information, can hallucinate during compression.
Structured scratchpad. Maintain a separate JSON/markdown blob the agent updates each step (goals, decisions, open questions). Stays compact even as the loop grows; requires discipline from the agent or runtime to update.

Modern model APIs add prompt caching as a fourth lever: prefix-stable parts of the context (system prompt, tool catalogue) get cached server-side and re-billed at a fraction of input price. This makes “send the whole history every step” cheaper, but doesn’t fix the window-size ceiling — at some point you still need compaction.

Long-term memory: the three flavours#

Episodic, semantic, procedural — same taxonomy a psych textbook uses, applied to agents:

Episodic. Records of specific past events. “Last Tuesday the user asked for X and the agent did Y and it worked.” Implemented as transcript logs, trajectory traces, or vector-indexed past sessions. Read at runtime when a current situation looks similar to a past one.
Semantic. Generalised facts. “The user works in PST.” “The schema for the orders table has these columns.” Implemented as key-value stores, profile rows, knowledge graphs, or vector indexes over fact corpora. Read whenever the agent needs background context that isn’t task-specific.
Procedural. How-to knowledge. “When asked to deploy, run these commands in this order.” Implemented as saved tool sequences, prompt-templates-as-skills, or even fine-tuned model adapters. Read when the current task matches a known procedure.

Reads and writes#

Reads are usually retrieval-augmented: the runtime issues a query (embedding similarity, SQL lookup, graph traversal) against long-term memory and injects the top-k results into the next prompt. Writes can be synchronous (the agent explicitly calls a remember() tool) or passive (the runtime logs every turn and a background job indexes them).

The interesting design question is what to remember. Logging everything is cheap but produces noisy retrievals. Logging only “high-signal” turns (explicit user preferences, successful task completions) keeps the index clean but requires a classifier or heuristic. Most production systems land somewhere in the middle: log everything to cold storage, index a curated subset.

Variants and trade-offs#

Pure in-context — everything the agent knows fits in the window. No external store. Simple, deterministic, easy to evaluate. Limits: hard ceiling on session length, cold every new session, no personalisation across sessions.

Context + long-term store — short-term in the window, episodic/semantic/procedural in external systems. Scales to indefinite horizons, supports personalisation. Costs: retrieval-quality bugs, staleness, an extra ops surface (vector DB, eval pipeline), nondeterminism in what gets recalled.

Other dimensions worth thinking about:

Vector index vs structured store. Vectors are great for fuzzy semantic retrieval; structured stores (Postgres rows, KV) are great when the lookup key is exact. Real systems mix both — vectors for “find me similar past chats”, structured for “what is this user’s timezone”.
User-scoped vs global memory. User-scoped memory is the safe default — each user has their own slice. Global memory (facts about the product, shared knowledge) is shared. Mixing without clear boundaries is how PII leaks between users.
Write-on-summarise vs write-as-you-go. Some agents summarise the session at the end and write the summary to long-term memory; others write per-turn. Per-turn is cheaper to recover from crashes; end-of-session is easier to keep clean.
Trainable vs prompt-based procedural memory. A “skill” can be a piece of text the agent retrieves and pastes into its prompt, or it can be a LoRA adapter loaded for that task. Prompt-based is easy to update; weight-based is faster at inference and more robust to prompt drift.

A practical layout for a personal-assistant agent's memory

Short-term: a 16k-token sliding window with summarisation when the trajectory exceeds 8k. A structured JSON scratchpad tracks goal, current_plan, open_questions.
Episodic: every conversation logged to object storage; an embedding index over conversation summaries, queried at the start of each new session with the user’s first message.
Semantic: a key-value user_profile keyed by user_id, holding stable preferences (timezone, communication style, recurring projects). Updated only when the user explicitly states a new preference.
Procedural: a small library of saved prompt-snippets for common tasks (“draft a weekly status”, “summarise this thread”). The agent picks one by embedding similarity to the current task description.

Total moving parts: one vector DB, one KV store, one object store, one scratchpad in the loop. Not glamorous; works well in practice.

When this is asked in interviews#

Memory questions are the easiest way to filter “has built an agent” from “has read about agents”. The naive answer is “we use a vector database”; the right answer separates short-term from long-term and is opinionated about what goes in each.

AI-product / AI-platform loops. “How does your agent remember things?” — strong answer covers short-term context management and long-term storage strategy, with concrete examples of what is stored where.
Senior backend loops. “Where does state live in your agent runtime?” — short-term in the loop process or a Redis-backed session, long-term in durable stores. Ops implications: scaling, replication, retention.
ML engineering loops. “How do you evaluate memory quality?” — recall on a held-out QA set against the stored memory, retrieval precision/recall, end-to-end task success with and without memory.

Common follow-ups:

“How do you handle stale memory?” — TTL on entries, decay weights in retrieval, explicit invalidation when the agent observes a contradiction.
“Won’t a 1M-token model just solve this?” — partially, for short-to-medium horizons. Doesn’t solve cross-session memory, doesn’t solve cost (you still pay for the context), doesn’t solve retrieval over a corpus larger than the window.
“How do you keep one user’s memory from leaking into another’s?” — strict scoping at the storage layer; never share embedding indexes across users for episodic memory; treat user_id as a partition key, not a filter.
“What’s the failure mode you’ve seen most?” — silent drift: the agent acts on a memory the user has since updated; mitigated by recency-weighted retrieval and explicit re-confirmation on high-stakes actions.