Case Studies

Deep-dives into named research and industry agent systems — MACRS, NVIDIA Eureka, ChainBuddy, WebVoyager, MuLan, OpenClaw. Concrete designs you can learn from.

6 items 2 Intermediate 4 Advanced

Patterns are useful; named systems are how you learn what patterns survive contact with reality. Each case study in this topic is a public agent system with a paper or product page — we walk through the problem the team was solving, the architecture they landed on, the key innovations that made it work, and the trade-offs they accepted.

The writeups are structured the same way: Context · Problem · Architecture · Key innovations · Evaluation · Trade-offs and limitations · Lessons · Related systems. Scanning across them, you see the same patterns recur in different combinations. That's the takeaway worth more than any single system.

Key concepts

Real agent systems are pattern-stacks, not pattern-singletons — Eureka uses ReAct + reflection + evolutionary search
Domain grounding dominates capability — WebVoyager works because of screenshot grounding, not because of a smarter LLM
Evaluation is the engineering — every system here has a non-trivial eval harness that justifies its design choices
Failure modes are specific to each design — read the limitations section, not just the wins
Most novel systems extend an existing pattern rather than inventing a new one

Reference template

// Reading a system writeup
## Context     — why this exists, what came before
## Problem     — what the team was actually solving
## Architecture — the shape, the components, the data flow
## Key innovations — what's actually new or non-obvious
## Evaluation  — how they measured it; the benchmarks they cared about
## Trade-offs  — what they gave up to ship
## Lessons    — what this design teaches you for your own systems
## Related systems

Adapt to your problem; the structure is the load-bearing part.

Common pitfalls

Copying an architecture without copying the eval — half the value is in the measurement, not the design
Assuming the paper's results generalise — most papers report best-case; production sees worst-case
Skipping the limitations — the lessons-from-failure are often the most useful part
Treating one system as the answer — agentic AI is moving fast; every design here will look dated within a year

Items (6)

MACRS — Multi-Agent Conversational Recommender
A multi-agent system for goal-directed conversational recommendations. Multi-agent act planning + user-feedback-aware reflection.

System Advanced
NVIDIA Eureka — LLM-Driven Reward Design
Coding LLMs that autonomously design and refine RL reward functions. Zero-shot generation, evolutionary search, reward reflection.

System Advanced
ChainBuddy — LLM Pipeline Generator
An agent that solves the 'blank page problem' for LLM-evaluation workflows. Requirement-gathering chat + multi-agent pipeline generation.

System Intermediate
WebVoyager — Multimodal Web Agent
A vision-and-text agent that navigates real websites. Multimodal ReAct loop, screenshot grounding, end-to-end task completion.

System Advanced
MuLan — Multimodal Diffusion Agent
An LLM-orchestrated approach to multi-object text-to-image generation. Planning, progressive generation, VLM-feedback control.

System Advanced
OpenClaw — Personal AI Assistant
A personal-assistant design: how to compose an agent that mixes calendar, mail, search, and reminders behind a single conversational surface.

System Intermediate

Case Studies

Key concepts

Reference template

Common pitfalls

Related topics

Items (6)

Keyboard shortcuts