← All items

Case Studies

Deep-dives into named research and industry agent systems — MACRS, NVIDIA Eureka, ChainBuddy, WebVoyager, MuLan, OpenClaw. Concrete designs you can learn from.

6 items 2 Intermediate 4 Advanced

Patterns are useful; named systems are how you learn what patterns survive contact with reality. Each case study in this topic is a public agent system with a paper or product page — we walk through the problem the team was solving, the architecture they landed on, the key innovations that made it work, and the trade-offs they accepted.

The writeups are structured the same way: Context · Problem · Architecture · Key innovations · Evaluation · Trade-offs and limitations · Lessons · Related systems. Scanning across them, you see the same patterns recur in different combinations. That's the takeaway worth more than any single system.

Key concepts

  • Real agent systems are pattern-stacks, not pattern-singletons — Eureka uses ReAct + reflection + evolutionary search
  • Domain grounding dominates capability — WebVoyager works because of screenshot grounding, not because of a smarter LLM
  • Evaluation is the engineering — every system here has a non-trivial eval harness that justifies its design choices
  • Failure modes are specific to each design — read the limitations section, not just the wins
  • Most novel systems extend an existing pattern rather than inventing a new one

Reference template

// Reading a system writeup
## Context     — why this exists, what came before
## Problem     — what the team was actually solving
## Architecture — the shape, the components, the data flow
## Key innovations — what's actually new or non-obvious
## Evaluation  — how they measured it; the benchmarks they cared about
## Trade-offs  — what they gave up to ship
## Lessons    — what this design teaches you for your own systems
## Related systems

Adapt to your problem; the structure is the load-bearing part.

Common pitfalls

  • Copying an architecture without copying the eval — half the value is in the measurement, not the design
  • Assuming the paper's results generalise — most papers report best-case; production sees worst-case
  • Skipping the limitations — the lessons-from-failure are often the most useful part
  • Treating one system as the answer — agentic AI is moving fast; every design here will look dated within a year

Related topics

Items (6)

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.