Context Window and Token Budget — Claude Code

Summary#

The context window is the model’s working memory: a finite token budget that holds the system prompt, the tool definitions, your messages, the model’s previous turns, and every tool result so far. It grows monotonically within a session and resets when the session ends. The window is generous — current Claude models offer hundreds of thousands of tokens — but it is not infinite, and on a long session the question is never “will it fit” but “will it fit and still leave room for the next useful thing.”

This is the single most important operational concept to understand if you want long sessions to stay coherent and short sessions to stay cheap. Almost every “Claude got worse mid-session” complaint is actually a context-window story.

Why it matters#

Context is where Claude’s competence lives. The model is stateless between API calls; everything it knows about your repo, your conversation, and your intent is reconstructed each turn from what’s in the window. When the window fills up, three bad things happen:

Compaction kicks in. Older turns get summarised to free space. The summary is good, but it is lossy — small details (a file path you mentioned, a function signature it read four turns ago) can vanish.
The model gets less careful. A near-full window leaves little room for the model’s own planning tokens. Plans get shorter, tool calls less considered, edits more error-prone.
Cost per turn climbs. Input tokens are billed every turn; a turn at 150k input tokens costs much more than the same turn at 30k. Long sessions in the wrong shape are surprisingly expensive.

Conversely, a session that respects the budget — reading only the files it needs, compacting deliberately, using sub-agents for research — stays coherent for hours and stays cheap doing it. The budget is a tool, not a constraint to fight.

How it works#

What lives in context#

Every turn, the model sees a structured input made up of:

System prompt. Anthropic-defined and Claude-Code-defined instructions. Includes tool definitions, behaviour rules, the active permission mode.
CLAUDE.md content. User-global, project, and subdirectory files concatenated in. Loaded once at session start; persistent for the session.
Memory index. MEMORY.md is always loaded; individual memory entries are pulled in on demand.
Conversation history. Every prior user message, every prior assistant turn (planning text plus tool calls), every tool result.
The current user message. Your latest prompt.

Tool results are the largest contributor in practice. A single Read of a 2,000-line file might be 15k tokens; a grep across the repo with full match context can be similar. The model’s own output is comparatively small — a paragraph of prose and a handful of tool calls per turn.

What gets summarised#

When context approaches the model’s limit (typically the last 20-30% of the window), Claude Code triggers automatic compaction. The mechanic is roughly:

Identify older turns that are no longer load-bearing (early reads of files that have since been edited, exploratory greps that have been resolved).
Replace them with a short summary the model writes itself: “Read these files, found X, made the following decisions, edited Y.”
Keep recent turns (last few exchanges) verbatim because they carry live state.

You’ll see a brief “compacting context” indicator when this happens. It is mostly silent and mostly correct. The cost is that fine-grained details from compacted turns are gone — the summary may say “edited the auth middleware to add request logging” but won’t remember the exact line numbers.

You can also compact deliberately with /compact — useful before a major scope change in the same session.

What falls out#

If compaction doesn’t free enough — usually because recent turns are themselves enormous (a 50k-token Bash output, say) — older content is truncated rather than summarised. The model is told “earlier history was truncated” and proceeds. This is the failure mode where the model “forgets what we agreed on three turns ago.” The fix is /clear (start fresh) or /compact (force a summary pass before more work).

Per-model budgets#

Different Claude models in the Claude Code lineup expose different context limits. Practically:

Opus / Sonnet ship with large windows — hundreds of thousands of tokens. Enough for several files of a medium repo plus a long conversation.
Haiku has a smaller but still substantial window. Fine for the mechanical-task niche; not the model to use for “read this whole module and refactor it.”

The budget is per-session, not per-account, and resets when you start a new conversation. There is no carry-over between sessions other than CLAUDE.md and memory.

Reading the budget#

The status line at the bottom of the CLI shows token usage as a percentage and absolute count when you opt into it via the custom status-line feature. The values you’ll see in a healthy session: a green percentage for the first half of the conversation, an amber band starting around 60-70%, red as you cross 85%. Compacting deliberately around the amber boundary is a cheap habit.

/cost shows the running token and dollar cost of the session. Useful for staying calibrated on what a typical session costs you.

Variants and trade-offs#

Stuff-it-in strategy — read many files at session start, let Claude have it all in context, work from there. Best for small repos and short tasks where the cost of reading is paid once. Falls apart on a 500-file codebase where you’d blow the window on the read.

Read-on-demand strategy — only read files when a turn needs them, lean on Grep / Glob for orientation, delegate broad searches to Explore sub-agents whose context is separate from yours. Slightly higher per-turn cost; vastly better at scale. The default for large repos.

For most engineers on real codebases, read-on-demand is the right default. The exception is a tight session on a small module — say five files in a feature directory — where reading them all upfront and then doing five turns of work is clearly cheaper than reading them five times.

The other axis is when to compact:

Compact early, often. Run /compact when usage hits 50-60%. Keeps headroom comfortable. Sacrifices some recall of older turns for predictable behaviour. Right for long sessions where you don’t need verbatim history.
Compact late, rarely. Let the model run until it auto-compacts at 80-85%. Maximises recall; risks one bad compaction wiping out something you needed. Right for short sessions where everything matters.

A pragmatic default: don’t manually compact unless you can name what would be wasted to keep. Auto-compaction is usually correct; manual compaction is for when you want to drop a section deliberately.

Sub-agents (Explore, Plan, custom ones) are the third lever: each runs in its own context window, returns a summary to the parent, and the parent sees only the summary. This is how you research a huge codebase without blowing the parent’s budget on file reads.

When this is asked in interviews#

The context window comes up often in AI-engineering interviews, sometimes very technically.

AI-product / coding-tool loops — “how do you handle a codebase that doesn’t fit in context?” is a common question. The answer covers selective reads, summarisation, sub-agents, retrieval, and the trade-offs between each. Knowing that Claude Code’s answer is “sub-agents and on-demand reading” makes you specific.
Cost-engineering interviews — “how do you keep an agentic system’s per-task cost predictable?” The answer is mostly context management: cap reads, compact aggressively, batch related work, avoid re-reading the same files. Showing the breakdown of where tokens go (system prompt vs tool results vs conversation) is a strong move.
Senior backend / infra loops at teams running internal agents — the question becomes “what’s your runaway-context defence?” The answer is hard limits, automatic compaction, alerting on per-task token counts, and circuit breakers when a single task crosses a threshold.

Less common in non-AI loops, but tokens-as-cost is increasingly a general engineering concept; being able to talk fluently about it is a soft advantage even outside AI roles.

The one habit that quietly halves long-session cost

Use sub-agents (Explore especially) for any broad-search question instead of doing the read in the main session. The Explore agent reads dozens of files in its own context, returns a paragraph, and the parent session pays for the paragraph rather than the reads. On a long session this is the difference between a $4 conversation and a$ 9 one for the same outcome.