MuLan — Multimodal Diffusion Agent — Agentic

Context#

Diffusion models are surprisingly bad at counting and at spatial composition. A prompt like “three red apples on a wooden table, with a blue mug to the left and an open book to the right” tends to produce two or four apples, drop the mug, or stack everything in a confused pile. The pre-trained generative prior captures objects beautifully and arrangements badly — there’s no explicit reasoning step in a single denoising pass.

The community spent years patching this with attention-control techniques, layout-conditioned diffusion, and region-specific guidance. These work to a point but require the user to specify boxes, or break down when objects interact non-trivially.

MuLan, published in 2024 (the “Multimodal-LLM Agent for Progressive Multi-Object Diffusion”), reframes the problem as agentic orchestration. Instead of trying to coax a single diffusion pass into the right composition, MuLan uses an LLM to plan the generation, generates objects progressively (one at a time, conditioned on the prior canvas), and uses a vision-language model to verify each step and re-plan if needed. The “agent” isn’t the diffusion model; it’s the planner and verifier that orchestrate many diffusion calls.

The interesting bit is that the architecture — plan, generate, verify, re-plan — looks identical to a coding agent or a web agent. The same control flow that runs a ReAct loop over web actions runs over diffusion calls in MuLan. Generation problems can be agent problems.

Problem#

The concrete problems MuLan targets:

Multi-object prompts. Prompts with three, five, eight distinct objects in specific spatial relationships. Stock diffusion models drop objects, merge them, or get the counts wrong.
Compositional control without manual layout. Layout-conditioned models work but require the user to draw boxes. MuLan’s promise is “give me your text prompt; I’ll handle the composition.”
Verification without a human in the loop. The classical fix is to generate many candidates and let the user pick. MuLan replaces the user with a VLM critic that checks whether each generated image matches the plan and triggers regeneration if not.
Order of generation matters. Generating the background first then objects in foreground-to-background order produces different results than the reverse. MuLan plans the order, not just the content.

Architecture#

MuLan is a three-agent loop wrapped around a diffusion model.

   ┌──────────────────────────────────────────────────────────────┐
   │                       User Prompt                             │
   │   "Three red apples on a wooden table, blue mug left,        │
   │    open book right"                                           │
   └──────────────────────────┬──────────────────────────────────┘
                              │
                              ▼
   ┌──────────────────────────────────────────────────────────────┐
   │                    LLM Planner Agent                          │
   │   Decomposes prompt into ordered sub-prompts + masks/regions │
   │                                                               │
   │   Step 1: wooden table (background)                          │
   │   Step 2: red apple (count=3) on table                       │
   │   Step 3: blue mug, left of apples                           │
   │   Step 4: open book, right of apples                         │
   └──────────────────────────┬──────────────────────────────────┘
                              │
                              ▼
   ┌──────────────────────────────────────────────────────────────┐
   │             Progressive Generation (loop)                     │
   │                                                               │
   │   for each step in plan:                                     │
   │     ┌────────────────────────────────────────────────────┐    │
   │     │ Diffusion call (conditioned on canvas + sub-prompt)│   │
   │     └────────────────────────────────────────────────────┘    │
   │                              │                                │
   │                              ▼                                │
   │     ┌────────────────────────────────────────────────────┐    │
   │     │   VLM Critic                                       │    │
   │     │   "Did this step succeed? What's wrong?"           │    │
   │     └─────────┬──────────────────────────┬───────────────┘    │
   │       (yes)  │                          │ (no)               │
   │              │                          ▼                    │
   │              │              ┌────────────────────────┐       │
   │              │              │ Planner re-plan / retry│       │
   │              │              └────────────────────────┘       │
   │              ▼                                                │
   │       commit step, advance                                    │
   └──────────────────────────┬──────────────────────────────────┘
                              │
                              ▼
   ┌──────────────────────────────────────────────────────────────┐
   │                       Final Image                             │
   └──────────────────────────────────────────────────────────────┘

The three roles#

Planner (LLM). Reads the prompt, decomposes it into an ordered list of generation steps. Each step has a sub-prompt (what to add), a region (where on the canvas), and dependencies (depends on which prior steps). The planner is what gives MuLan its “this is an agent” character — it’s where reasoning happens.
Generator (Diffusion). A stock text-to-image diffusion model, called once per step with the current canvas as image conditioning plus the step’s sub-prompt. The trick is regional conditioning: the generator’s attention is biased toward the planned region for the new object, leaving the rest of the canvas alone. The diffusion model isn’t modified; the conditioning is.
Critic (VLM). A vision-language model that examines the canvas after each step and answers a fixed-form question — "Does this image contain <object>? At <location>? With <attributes>?" — with placeholder slots filled from the plan. The critic produces a structured pass/fail and, on fail, a short explanation. The planner reads the explanation and decides — retry the step with adjusted parameters, replace the step with a different sub-prompt, or skip the step.

Regional control mechanism#

The key technical lever MuLan uses for where objects appear is attention-mask conditioning. The diffusion model’s cross-attention layers are masked so that each region of the latent attends preferentially to the step’s sub-prompt tokens. This is region-specific guidance applied per step, without retraining the diffusion model — the masks come from the planner’s layout, not from a human.

A second lever is canvas conditioning — the prior step’s canvas is fed back as image conditioning (via image-to-image or inpainting interfaces), so the diffusion model sees what’s already there and doesn’t overwrite it.

Key innovations#

What makes MuLan more than “diffusion plus a planner”:

Progressive generation as the primary mechanic. One object at a time, with explicit checks between steps. The single-pass alternative is what every prior approach tried; the agentic reframe is what unlocks compositional fidelity.
VLM as in-the-loop critic. Most diffusion pipelines either generate many candidates and let the user pick, or score the final image after the fact. MuLan checks after each step and intervenes immediately. This is the same closed-loop pattern as a coding agent’s compiler-feedback loop — verifier in the inner loop, not in the outer one.
Plan as structured intermediate state. The planner emits a typed plan (object, region, attributes, order) — not a free-form chain of instructions. The plan is auditable, editable, and replayable. If the user disagrees with the planner’s decomposition, they can override it directly.
No fine-tuning anywhere. Planner is a general LLM, critic is a general VLM, generator is a stock diffusion model. The capability comes from composing three off-the-shelf components with a carefully designed protocol — a hallmark of strong agentic systems.
Retry policy as part of the planner’s job. When the critic flags a step, the planner doesn’t just say “try again”; it can adjust the sub-prompt, swap a different object instance, change the region, or skip. The retry behaviour is itself reasoned over, not hardcoded.

Evaluation#

MuLan was evaluated against single-pass diffusion baselines and layout-conditioned baselines on multi-object prompt benchmarks:

Object presence and count accuracy. Did the generated image contain all the prompted objects, in the prompted counts? MuLan reportedly improved markedly over single-pass baselines — fewer dropped objects, fewer miscounts.
Spatial-relationship accuracy. Did “left of”, “right of”, “above”, “below” relationships hold? MuLan won here too, since the region-aware planner has explicit spatial reasoning.
Image quality metrics. FID and CLIP-similarity were comparable to baselines or slightly worse — progressive generation can leave artefacts at object boundaries that single-pass diffusion avoids. The trade-off is fidelity in composition vs. seamless single-pass aesthetics.
Human preference. In paired comparisons, raters preferred MuLan outputs for prompts with three or more objects; baselines were preferred for one-object prompts where composition isn’t the bottleneck.

The evaluation uses both VLM-based automatic metrics (CLIP score, GPT-4V judgement) and human raters, with the human signal as the ground truth.

Trade-offs and limitations#

Where MuLan shines. Multi-object prompts with countable objects and explicit spatial relationships. Prompts where composition fidelity matters more than seamless aesthetics. Workflows where the user can wait a few seconds for the multi-step generation.

Where it doesn’t help. Single-object prompts. Abstract or non-compositional prompts (“a dreamy mood, soft lighting”). Real-time generation where the additional latency of plan + critic loops is unacceptable. Tasks where the critic VLM can’t reliably distinguish success from failure.

Other limitations:

Boundary artefacts. Pasting objects onto an existing canvas via image conditioning leaves seams. Diffusion’s strength is global coherence; progressive generation fights that strength. MuLan partially mitigates with refinement passes but the artefact pattern is recognisable.
Latency stacks. A 5-step plan with two retries somewhere is 7+ diffusion calls plus a VLM call after each, plus several LLM planner calls. A single-pass diffusion is a few seconds; MuLan is 30+ seconds for a complex prompt. Worth it when composition matters; expensive otherwise.
Critic-as-ground-truth limit. If the VLM critic can’t tell when a step succeeded — for fine-grained attributes, unusual objects, or styles outside its training — the loop drives the wrong direction. The system inherits the critic’s blind spots.
Planner mis-decomposition. Some prompts don’t decompose cleanly. “A reflection of a vase in a mirror” doesn’t have a natural step order; pulling it apart makes things worse, not better. The planner needs a “this doesn’t decompose; punt to single-pass” escape hatch.
Region specification is approximate. Bounding boxes the planner emits are coarse; final object positions can drift within them. The system doesn’t attempt pixel-level control.

Lessons#

Generalisable takeaways from MuLan:

Generation tasks can be agent tasks. The plan-generate-verify loop isn’t unique to text or code. Any generation domain where (a) the output is composable, (b) intermediate states are inspectable, and (c) a critic can score sub-outputs is a candidate for agentic orchestration.
Verifier in the inner loop beats verifier at the end. Catching a missed object after one diffusion call costs you one redo. Catching it after the whole image costs you a full regeneration. Push verification as close to each generation step as you can.
Use off-the-shelf models with strong protocols. MuLan’s three components are all stock. The originality is in the composition. This is the right place for most teams to start — designing the protocol first, training models only if the protocol’s ceiling is too low.
Surface the plan as structured state. Letting users see and edit the decomposition is alignment for free. Hidden chain-of-thought is hard to debug; an explicit plan is the artefact you can iterate on.
Latency for fidelity is a fair trade if you let the user choose. Many users will accept 30 seconds for a faithful multi-object image; others want 3 seconds for a rough sketch. Build both paths, gate them on prompt complexity or an explicit toggle.

Why not just train a bigger diffusion model that can count?

This is what every major lab tries, and it helps — newer models do count better than older ones. But (a) it’s expensive to retrain a diffusion model and you can’t intervene at inference time, (b) any new compositional failure mode requires another round of training data and another training run, (c) the system has no observable reasoning trace — when it gets a count wrong, you can’t tell why. MuLan’s agentic approach makes the reasoning explicit (the plan), keeps the diffusion model stock, and turns “what went wrong” into something you can step-debug. The training-the-model-bigger path complements but doesn’t replace the agentic approach.