MACRS — Multi-Agent Conversational Recommender — Agentic

Context#

Conversational recommendation is the unloved sibling of search and feed-ranking. A user shows up wanting a movie, a restaurant, a book, a flight — and instead of typing a query and scanning a list, they want to talk to the system: ask follow-up questions, narrow the slate, reject options, change their mind. The recommender, in turn, has to ask the right clarifying questions, surface the right candidates, and know when to stop talking and actually recommend something.

The classical answer is a single retrieval-augmented chat model with a prompt that says “be a good recommender.” That works for the first turn and falls apart by turn five. The system can’t tell whether it’s gathering preferences, narrowing candidates, justifying a pick, or recovering from a rejection — they all look like “the next chat reply” to a single model.

MACRS — Multi-Agent Conversational Recommender System, published in 2024 — re-frames the problem around the conversational acts a good recommender performs. Instead of one model answering everything, MACRS uses a small team of specialised agents that plan which act to perform next, generate the act’s content, and reflect on the user’s response before the next turn. The architecture is the point: the team-of-agents structure encodes domain knowledge about how conversational recommendation actually works.

Problem#

The concrete problems MACRS attempts to solve:

Goal-directed dialogue. A recommender conversation has a goal (deliver a recommendation the user accepts) and a budget (the user’s patience, measured in turns). A naive chat model optimises for “good next reply” rather than “shortest path to a successful recommendation.”
Preference acquisition without interrogation. The system needs the user’s preferences but can’t extract them by asking “what genre? what year? what mood? what language?” in sequence — that’s a form, not a conversation. It has to weave preference-probing into natural exchanges.
Candidate-set management. A real catalogue has millions of items. The agent has to maintain a working set of candidates that shrinks with each turn, supports justification (“you said you like X, so consider Y”), and recovers when the user rejects a candidate.
Recovery from negative feedback. When the user says “not that, something more upbeat,” the system needs to update its model of the user’s preferences and its candidate set. Most chat-recommenders silently ignore the negative signal and recommend a near-duplicate.
Knowing when to stop probing. Each clarifying question costs a turn. The system has to decide, every turn, whether one more question is worth the cost or whether it should make the recommendation now.

The framing — multiple coordinating agents, each responsible for a different concern — is borrowed from the broader trend in agentic systems toward specialisation when one model can’t hold the whole job.

Architecture#

MACRS is structured as four coordinating agents, plus a shared dialogue state and candidate set.

                 ┌─────────────────────────────┐
                 │      Dialogue State         │
                 │  (user prefs, history,      │
                 │   candidate set, turn count)│
                 └─────────────┬───────────────┘
                               │  read / write
        ┌──────────┬───────────┼───────────┬──────────┐
        ▼          ▼           ▼           ▼          ▼
   ┌────────┐ ┌────────┐  ┌────────┐  ┌────────┐ ┌────────┐
   │ Plan   │ │ Act    │  │ Resp.  │  │ Reflect│ │ Cand.  │
   │ Agent  │ │ Agent  │  │ Agent  │  │ Agent  │ │ Store  │
   └───┬────┘ └───┬────┘  └───┬────┘  └───┬────┘ └────────┘
       │          │           │           │
       │ "next    │ "generate │ "user's   │ "preference
       │  act"    │  question │  reply"   │  delta + plan
       │          │  / pick"  │           │  update"
       ▼          ▼           ▼           ▼
                User-visible turn

The four agents#

Plan Agent. Reads the dialogue state and decides which conversational act to perform next. The act space is small and discrete — typically ask-preference, ask-constraint, narrow-candidates, justify-and-recommend, clarify-rejection, wrap-up. The Plan Agent is the policy of the system; it’s responsible for the “where in the conversation are we” judgement.
Act Agent. Given a chosen act, generates the surface content. If the act is ask-preference, the Act Agent produces a natural question that targets the highest-information-gain dimension given the current candidate set. If the act is justify-and-recommend, it composes the recommendation with reasons. This separation lets the Plan Agent reason at the policy level without committing to surface form.
Response Agent (user-facing). Wraps the Act Agent’s output for the user — formatting, persona, tone consistency across turns. In the simplest deployment this is a thin layer; in more elaborate ones it personalises the wording to the inferred user style.
Reflection Agent. After the user replies, reads the new utterance, extracts preference deltas (positive and negative), updates the candidate set’s scoring, and produces a structured summary for the Plan Agent’s next decision. The reflection step is the agent’s “did that turn work” judgement.

Shared state#

The candidate store holds the current working slate (typically 5 to 50 items, depending on catalogue size and turn). The dialogue state holds: extracted preferences as a feature vector, prior turn history, the Plan Agent’s last chosen act, the user’s acceptance/rejection markers, and a turn counter.

This is a textbook supervisor / specialist multi-agent shape — the Plan Agent supervises, the Act and Reflection agents specialise, the Response Agent is the I/O layer. The contribution of MACRS isn’t novelty in the shape; it’s the act space and the reflection scheme tailored to recommendation.

Key innovations#

Three pieces of MACRS go beyond a generic multi-agent template:

A discrete, domain-specific act space. Most multi-agent chat systems let the supervisor produce free-form instructions to the workers. MACRS constrains the supervisor’s output to a small enum of conversational acts. The constraint makes the Plan Agent’s reasoning tractable, its outputs auditable, and its choices learnable from offline trajectories. You can label good and bad conversations by which acts were chosen at which point.
Feedback-aware reflection. The Reflection Agent doesn’t just summarise the user’s reply; it produces a signed preference delta (positive features the user expressed liking, negative features they rejected) and a plan adjustment hint. The Plan Agent reads both. This is the difference between “the user said something” and “the user’s most recent message means we should narrow the candidate set on the genre axis and avoid the action-heavy region.”
Candidate-set as first-class state. The shared candidate store is queried and rewritten every turn by the Reflection Agent, not just at the end. The Plan Agent can ask “how many candidates left after the latest filter?” before deciding whether to recommend or to probe further. Most chat-recommenders only fetch candidates once, at the start, and try to keep the conversation alive against a frozen slate.

A subtler design choice: each agent uses prompted general-purpose LLMs, not separately trained models. The specialisation comes from prompts and the act space, not from per-agent fine-tuning. That makes the system reproducible and the agent roster easy to extend (add a new act, write its prompt, register it with the Plan Agent).

Evaluation#

MACRS was evaluated against three baselines on conversational-recommendation benchmarks: a single-prompt chat model, a retrieval-augmented chat model, and a previous-generation multi-agent recommender. The headline metrics in the paper:

Success rate at turn budget. Did the user accept a recommendation within the turn limit (typically 8 turns)? MACRS reportedly beat the single-model baselines by a substantial margin and edged out the prior multi-agent baseline.
Turns to acceptance. Among successful conversations, how many turns to first accepted recommendation? MACRS converged faster — the act-level planning chose probing-vs-recommending more accurately.
User satisfaction proxies. Standard recommender-conversational metrics: appropriateness of clarifying questions, relevance of recommendations, recovery from rejection. MACRS scored higher across all three.

The evaluation setup uses simulated users plus a smaller human study; the simulated-user part is itself a multi-turn LLM that follows a hidden preference profile. This is the standard pattern in conversational-recommendation research and worth noting separately — the eval harness is non-trivial to build and is half the paper’s contribution.

Trade-offs and limitations#

Where MACRS shines. Catalogues with structured features (genre, year, length, language, mood). Domains where the user benefits from a few clarifying questions before a recommendation. Conversations where the user is willing to invest 5–8 turns. Domains with a clear “accepted” signal.

Where it struggles. Cold-start with no candidate set scaffolding. Users who want one-shot recommendations and resent probing. Domains with unstructured items (free-text product descriptions, where preference dimensions are fuzzy). Conversations where the user changes goals mid-stream — the act space assumes a single recommendation target.

Other limitations:

Latency stacks. Four agent calls per user turn — Plan, Act, Response, then Reflection on the user’s reply before the next turn — adds up. The team mitigates this with smaller, faster models for the Response and Reflection roles, but a sub-second total turn cost is hard.
Discrete act space is also a ceiling. When the user’s behaviour falls outside the enum (asking a meta-question, going off-topic, mixing two recommendation goals), the Plan Agent has nowhere to route. Extending the act space is straightforward but adds maintenance.
Reflection-agent hallucination. If the Reflection Agent fabricates a preference the user didn’t express, the candidate set drifts. The system relies on the Reflection Agent’s groundedness; the paper notes this as an open issue.
Catalogue scale. The Reflection Agent rewrites the candidate scoring after every turn. For a million-item catalogue this is impractical unless the working slate is bounded and the rescoring is a re-ranking of the slate, not a full-catalogue re-score.

Lessons#

Generalisable takeaways from MACRS:

Constrain the supervisor. A small discrete act space beats free-form supervisor instructions for any domain where you can enumerate the recurring moves. The act space is documentation, debugger, and learnable policy all at once.
Make state first-class. The candidate set is shared, observable, and edited by name. Agents that “remember” everything in their own context are harder to reason about than agents that read and write a shared structured state.
Reflect on the user’s reply, not just on your own output. Most agent-reflection schemes critique what the agent did. The user’s response is more information-rich; pipe it through a dedicated reflection step.
Specialise via prompts, not via models. General-purpose LLMs with role-specific prompts are usually enough for an early multi-agent system. Save fine-tuning for the layer where the prompts have hit their ceiling — usually the Reflection Agent’s extraction quality.
Turn budgets are a UX feature, not a constraint. Users have a patience curve. The Plan Agent’s job is to spend that budget — probe vs recommend — in the order that maximises acceptance. Optimising for “best next reply” without a budget is the wrong objective.

Why not just one model with a longer prompt?

Two reasons. First, the act-space discretisation gives you observability — you can label trajectories by which acts ran when, find the failure modes (“the Plan Agent kept choosing ask-preference after the candidate set was already small enough to recommend”), and improve them directly. A single model with a long prompt is a black box; you can’t tell from outside whether the failure was a planning mistake or a generation mistake. Second, separate agents let you mix models — a slow capable model for planning and reflection, faster cheaper models for surface generation. A monolith forces you to pay the capable-model cost on every turn.