Case Studies
Deep-dives into named research and industry agent systems — MACRS, NVIDIA Eureka, ChainBuddy, WebVoyager, MuLan, OpenClaw. Concrete designs you can learn from.
Patterns are useful; named systems are how you learn what patterns survive contact with reality. Each case study in this topic is a public agent system with a paper or product page — we walk through the problem the team was solving, the architecture they landed on, the key innovations that made it work, and the trade-offs they accepted.
The writeups are structured the same way: Context · Problem · Architecture · Key innovations · Evaluation · Trade-offs and limitations · Lessons · Related systems. Scanning across them, you see the same patterns recur in different combinations. That's the takeaway worth more than any single system.
Key concepts
- Real agent systems are pattern-stacks, not pattern-singletons — Eureka uses ReAct + reflection + evolutionary search
- Domain grounding dominates capability — WebVoyager works because of screenshot grounding, not because of a smarter LLM
- Evaluation is the engineering — every system here has a non-trivial eval harness that justifies its design choices
- Failure modes are specific to each design — read the limitations section, not just the wins
- Most novel systems extend an existing pattern rather than inventing a new one
Reference template
// Reading a system writeup
## Context — why this exists, what came before
## Problem — what the team was actually solving
## Architecture — the shape, the components, the data flow
## Key innovations — what's actually new or non-obvious
## Evaluation — how they measured it; the benchmarks they cared about
## Trade-offs — what they gave up to ship
## Lessons — what this design teaches you for your own systems
## Related systems Adapt to your problem; the structure is the load-bearing part.
Common pitfalls
- Copying an architecture without copying the eval — half the value is in the measurement, not the design
- Assuming the paper's results generalise — most papers report best-case; production sees worst-case
- Skipping the limitations — the lessons-from-failure are often the most useful part
- Treating one system as the answer — agentic AI is moving fast; every design here will look dated within a year
Related topics
Items (6)
- MACRS — Multi-Agent Conversational Recommender
A multi-agent system for goal-directed conversational recommendations. Multi-agent act planning + user-feedback-aware reflection.
System Advanced - NVIDIA Eureka — LLM-Driven Reward Design
Coding LLMs that autonomously design and refine RL reward functions. Zero-shot generation, evolutionary search, reward reflection.
System Advanced - ChainBuddy — LLM Pipeline Generator
An agent that solves the 'blank page problem' for LLM-evaluation workflows. Requirement-gathering chat + multi-agent pipeline generation.
System Intermediate - WebVoyager — Multimodal Web Agent
A vision-and-text agent that navigates real websites. Multimodal ReAct loop, screenshot grounding, end-to-end task completion.
System Advanced - MuLan — Multimodal Diffusion Agent
An LLM-orchestrated approach to multi-object text-to-image generation. Planning, progressive generation, VLM-feedback control.
System Advanced - OpenClaw — Personal AI Assistant
A personal-assistant design: how to compose an agent that mixes calendar, mail, search, and reminders behind a single conversational surface.
System Intermediate