Hierarchical Planning — Agentic · Engineering Playbook

What it is#

Hierarchical planning is the pattern where a single agent task is split across two (or more) layers: a high-level planner that decomposes the goal into sub-goals, and a low-level executor that carries each sub-goal out. The planner thinks in milestones; the executor thinks in tool calls. Each layer has its own loop, its own context window, and often its own prompt.

The motivation is the long-horizon problem. A flat ReAct loop works well for tasks of ~10 steps. Past that, the context window fills up with observation history, attention degrades, the model starts forgetting earlier state, and the agent drifts off-goal. Hierarchical planning fixes this by giving each sub-task a fresh context: the executor sees only the current sub-goal and the relevant local state, not the full trace of everything that came before. The planner sees only milestone-level summaries, not the executor’s tool calls.

In control theory terms, this is exactly the same trick recursive composition gives you elsewhere — bound the working set, push details into a sub-routine, return only a summary. The agent version: bound the LLM context, push details into a sub-loop, return only a summary of what happened.

When to use it#

Hierarchical planning is worth the added orchestration when:

The task is long-horizon. More than ~15 actions to complete. A research report (gather sources, read each, synthesise, write), a multi-page web flow (navigate, find item, configure, check out, confirm), a multi-file code refactor.
Sub-tasks are largely independent. “Read paper A” and “read paper B” don’t share state. “Find the right product” and “complete checkout” don’t share state. Independence is what lets each sub-task have a fresh context.
Different sub-tasks need different tools or prompts. A planner that thinks about strategy needs different framing than an executor that drives a browser. Specialisation per layer is easier than one prompt that does both.
The plan itself is non-trivial. If you can write the plan as a static template, do that instead. Hierarchical planning earns its keep when the plan is dynamic — depends on the input, on intermediate findings, on user goals.

Don’t use hierarchical planning when:

The task is short or well-structured. A 3-step task doesn’t need a planner; the executor is the agent. Flat ReAct is simpler and easier to debug.
Sub-tasks share heavy state. If every sub-task depends on what the previous one observed, hierarchy buys you little — the executor needs the same context every time. Use flat ReAct with a memory module instead.
Latency is critical. Each layer adds round-trips. A planner call plus N executor calls is strictly more latency than N flat-loop calls.

How it works#

The two-layer canonical structure#

[Planner]
Goal: Book a 3-day trip to Lisbon next month under $1500.
Sub-goals:
  1. Find flights LHR → LIS, return; under $400.
  2. Find a hotel in Alfama for 3 nights; under $600.
  3. Compose the itinerary and confirm with user.

[Executor on sub-goal 1]
ReAct loop with browser/search tools.
Returns: { flights_found: 2, recommended: AA1234, price: 380, ... }

[Executor on sub-goal 2]
Fresh context. Sees only sub-goal 2 and (optionally) a brief summary of sub-goal 1's outcome.
ReAct loop with browser/booking tools.
Returns: { hotel: Pousada Alfama, total: 540, ... }

[Executor on sub-goal 3]
Fresh context. Sees the summaries of 1 and 2.
ReAct loop with messaging/confirmation tools.

[Planner]
All sub-goals returned. Itinerary assembled.
Final answer to user.

Notice the data flow:

The planner sees goal + sub-goal summaries, not the executors’ steps.
Each executor sees its sub-goal + (small) context from prior sub-goals + its own steps, not the planner’s reasoning or other executors’ steps.
The planner is invoked at the start (initial plan) and at the end (assemble), and possibly in the middle (re-plan when an executor fails or returns something unexpected).

This is the architectural shape. Variants change what summary flows where and when the planner is re-invoked.

The planner’s job#

The planner does three things:

Decompose the goal into sub-goals. Each sub-goal is concrete enough that an executor can attempt it without further decomposition (or with one more level of decomposition, if you have a three-layer hierarchy).
Order the sub-goals. Sequential (B depends on A’s result), parallel (B and C are independent), or DAG-shaped (B and C both depend on A but not each other).
Decide when to re-plan. If an executor fails or returns something unexpected, the planner is re-invoked with the failure context to revise the remaining plan.

The planner’s output is typically structured — a list of sub-goal objects with names, descriptions, dependencies, and success criteria. This structure is what the orchestrator uses to dispatch executors and to track progress.

The executor’s job#

The executor receives a single sub-goal and runs its own loop — usually ReAct — to satisfy it. The executor is closer to a vanilla agent: it has tools, it interacts with the environment, it produces an observation log. The key difference from a flat agent is that the executor’s success criteria are passed in from the planner, not inferred from the user’s original prompt.

When the executor finishes, it returns a structured result to the planner: { status: success/failure, summary: ..., key_findings: ..., artifacts: ... }. The summary is what flows up; the full trace stays in the executor’s logs.

Context budget — the central design question#

The whole point of hierarchical planning is to bound each layer’s context. The design question is what summary flows between layers.

Three extremes, each with characteristic problems:

Pass nothing up. The planner sees only { status: success } and a free-text summary the executor wrote. Cheapest, smallest context. Risk: the planner makes a downstream decision that depends on detail the executor omitted from the summary.
Pass everything up. The planner sees the full executor trace. Defeats the purpose — the planner’s context fills up with executor steps, and you’re back to flat-loop pathology.
Pass structured key findings. The executor returns a typed object with the specific fields the planner cares about ({ recommended_flight: ..., price: ..., alternatives: [...] }). Costs slightly more than free-text summaries but gives the planner reliable affordances.

The third option is the right default. The schema for what each executor returns is part of the system design, not the agent’s prompt.

Depth — when to go three layers, four layers#

Most production hierarchical agents have exactly two layers — planner and executor. Three layers (super-planner, sub-planner, executor) is reasonable for very long-horizon tasks (a multi-week research project, a months-long agentic workflow). Four or more is almost always a sign that the abstraction at each level is too thin and could be collapsed.

The right depth is the depth at which each layer has a coherent vocabulary. A planner that thinks in “find the right hotel” and an executor that thinks in “click the search button” are clearly different abstractions. A “sub-planner” that thinks in “open the booking site” is in between and probably belongs in either the planner (as a sub-goal) or the executor (as the first step).

Re-planning on failure#

The classic failure mode of static plans: the world doesn’t cooperate, an executor fails, and the rest of the plan is now stale. Hierarchical agents handle this with a re-planning step:

Executor reports failure. Returns { status: failure, reason: ..., context: ... }.
Planner is re-invoked with the failure. Sees the original goal, the remaining sub-goals, and the failure detail.
Planner emits a revised plan. May skip the failed sub-goal, substitute an alternative, or abandon downstream sub-goals that depended on it.

Re-planning is the place where hierarchical agents earn their flexibility relative to static workflows. A static workflow stops on the first failure. A hierarchical agent routes around it.

Coordination between sub-tasks#

When sub-tasks have dependencies, the planner is in charge of the DAG. The orchestrator should respect the DAG: dispatch independent sub-tasks in parallel, wait on dependencies, fail fast on critical-path failures.

When sub-tasks are nominally independent but turn out to share state (the executor for sub-goal 2 needed something from sub-goal 1 that wasn’t in the summary), one of three things happens:

The schema is wrong. Expand what the executor returns so the missing field is included.
The plan is wrong. The dependency should have been explicit. Re-plan.
The pattern is wrong. If sub-tasks share too much state, hierarchical decomposition is fighting the problem. Flatten back out.

Variants#

Planner-Executor (classic). Two layers, planner runs once or twice, executors run per sub-goal. The default.
Iterative planning (re-plan after every sub-goal). The planner runs after each executor returns, deciding what to do next based on the latest result. More flexible, more expensive — and harder to debug because the plan is never “stable”.
Tree-of-thoughts with selection. Planner generates multiple candidate plans; a critic ranks them; the best is dispatched. Useful when the plan-quality bottleneck is the dominant cost.
Manager-worker (multi-agent variant). Planner is a literal supervisor agent; executors are worker agents with their own personas and tool surfaces. The boundary between hierarchical planning and multi-agent orchestration blurs here — see Multi-Agent Orchestration.
Plan-first / interleaved. Plan-first generates the whole plan up front; interleaved generates one sub-goal at a time, dispatches, observes, plans the next. Plan-first is more predictable; interleaved adapts faster.
HTN-style (hierarchical task network). A formalism from classical planning. Each task is either primitive (can be executed directly) or compound (decomposes into sub-tasks via a method). Modern LLM agents rediscover this shape — the model is the implicit method library.

Why the planner usually isn't 'just a fine-tuned planner model'

An attractive idea: train a specialised planner model that’s much better at decomposition than the generalist executor model. In practice, two reasons it rarely pays off in production:

The planner’s job is goal-conditioned, and the goal space is open-ended. A planner trained on tasks like yours is brittle when the task shifts.
The planner only runs once or twice per top-level task. Its cost is small relative to executor compute. Optimising it specifically buys little overall.

What does work: a strong general-purpose model as the planner, with carefully tuned few-shot examples in its prompt. The structure carries more weight than the model identity.

Example systems#

WebVoyager — for long tasks, an outer planner sets sub-goals (“find the cheapest flight”, “complete the booking”, “confirm with user”) and the inner ReAct loop executes each. Bounded context per sub-goal keeps the loop tractable on tasks that would otherwise overflow.
MuLan — the LLM-orchestrated diffusion pipeline is hierarchical: a top-level planner decomposes the prompt into per-object plans, and a per-object diffusion executor renders each. The hierarchy maps to the visual structure of the scene.
ChainBuddy — pipeline generation runs as a multi-agent hierarchy: requirement-gathering chat at the top, pipeline-component generators below, each producing a piece of the final workflow.
Long-horizon coding agents. Modern coding agents that tackle multi-file refactors run hierarchically: the top level plans the refactor (which files, in what order, with what test gates), each sub-task is an edit-and-verify loop on one file.

Trade-offs#

Hierarchical planning — handles long-horizon tasks; bounded context per layer; specialised prompts per layer; supports re-planning on failure. Higher orchestration complexity; harder to debug across layers; the planner adds a critical-path round trip.

Flat ReAct loop — simplest to build; single trace to debug; lowest orchestration overhead. Degrades past ~15 steps; context fills with observation history; no natural place for specialisation.

Other axes:

Planning depth vs adaptability. Deeper plans (more upfront decomposition) are more predictable but less adaptive. Shallow plans (one sub-goal at a time) are more adaptive but harder to budget. Match depth to how stable your environment is.
Strict hierarchy vs hybrid. Some systems mix flat-loop sections with hierarchical sections — flat for short interactive turns, hierarchical for long-running background tasks. The architecture isn’t all-or-nothing.
Planner cost vs executor cost. The planner runs few times; executors run many times. Use a strong (and expensive) model for the planner if it improves plan quality; use a cheaper model for executors if they’re doing routine tool dispatch.