ChainBuddy — LLM Pipeline Generator — Agentic

Context#

Anyone who has tried to build an LLM-evaluation pipeline from scratch knows the feeling: you have a vague sense that “this prompt is producing bad outputs,” you want to measure it, and you’re staring at an empty Python file wondering where to start. Do you need a golden dataset? A judge model? A pairwise comparison? Pass-at-k? Metrics over what — relevance, correctness, faithfulness, style? Should it be online or batched? Where do the test cases come from?

This is the “blank page problem.” It isn’t unique to LLM eval — it’s the same problem a junior data scientist faces when asked to “investigate user churn” — but it’s especially acute for LLM evaluation because the tooling is young, the metrics are not standardised, and the workflow has a dozen judgement calls before you write a single line of code.

ChainBuddy is an agent designed specifically to walk users through this problem. It runs a guided conversation, extracts the user’s requirements, and outputs a runnable LLM-evaluation pipeline — a sequence of nodes (data load, prompt generation, model call, judge, aggregation) wired together as a directed graph. The interesting bit isn’t that the output is a pipeline; it’s that the system frames the requirement-gathering phase as a first-class agent loop, separate from the generation phase.

Problem#

The concrete problems ChainBuddy targets:

Users don’t know what they don’t know. They want to “evaluate my prompt” but haven’t decided whether they need a binary classifier, a regression scorer, a rubric judge, or a comparative arena. Asking them upfront is a non-starter; they’d ask back, “which should I use?”
The configuration space is large. Each pipeline node has its own choices: which data source, which prompt template, which judge model, which scoring scheme, which aggregation. Multiplied across 5–8 nodes, the combinatorics defeat a one-shot generation.
The output has to actually run. Unlike a chat answer or a recommendation, a generated pipeline either executes or doesn’t. There’s no soft-pass condition. The agent has to produce a graph that wires together correctly, with all node interfaces compatible.
User intent shifts during the conversation. A user starts saying “I want to score relevance” and by the third turn realises they actually want faithfulness. The agent has to update its working understanding without restarting the conversation.
The output must be editable. Users don’t trust the agent enough to run a generated pipeline blind. The graph has to be presented as a structure they can inspect and edit, not as opaque code.

Architecture#

ChainBuddy splits into two sequential agent phases: requirement gathering, then pipeline synthesis. Each phase has its own internal structure.

┌────────────────────────────────────────────────────────────────┐
│                     Phase 1: Requirements                       │
│                                                                 │
│    User ◄──► Requirement Agent ──► Requirement Spec (JSON)      │
│                     │                                           │
│                     │ "I need a clarifying question"            │
│                     ▼                                           │
│              Clarification Agent                                │
└────────────────────────┬───────────────────────────────────────┘
                         │
                         ▼  (spec is "done" — confirmed with user)
┌────────────────────────────────────────────────────────────────┐
│                Phase 2: Pipeline Synthesis                      │
│                                                                 │
│   Spec ──► Planner Agent ──► Node-list (ordered DAG sketch)     │
│                 │                                               │
│                 ▼                                               │
│           Node Generators (parallel)                            │
│        ┌──────────┬──────────┬──────────┐                       │
│        │ Loader   │ Prompter │ Judge    │ ... per node          │
│        └──────────┴──────────┴──────────┘                       │
│                 │                                               │
│                 ▼                                               │
│           Wiring Agent ──► Validated DAG ──► Pipeline file      │
└────────────────────────────────────────────────────────────────┘

Phase 1 — requirement gathering#

The Requirement Agent maintains a structured spec (JSON) that grows as the conversation progresses. The schema is fixed and small — task type, data source, prompt under evaluation, scoring scheme, judge configuration, output format. The agent’s job is to fill every required field through conversation.

Each turn, the agent either: asks a clarifying question (delegated to the Clarification Agent, which knows how to phrase the question well given the gap in the spec), restates its current understanding for the user to confirm, or proposes a value with reasoning. The user can override any field at any time.

The agent stops when every required field is filled and the user has explicitly confirmed the spec — typically 4–8 turns for a non-trivial pipeline.

Phase 2 — pipeline synthesis#

The Planner Agent reads the confirmed spec and emits a high-level node sequence — “load CSV → render prompt → call model → judge with rubric → aggregate to a score.” This is a sketch, not the runnable graph yet; it’s the skeleton that the next layer fills in.

The Node Generators are role-specialised — one per node type (data loader, prompt template node, model-call node, judge node, aggregator node, output node). Each generator takes the relevant slice of the spec, fills in its node’s configuration, and produces a typed node definition.

The Wiring Agent checks that each node’s outputs match the next node’s inputs (types, shapes, fields), repairs mismatches when possible, and produces the final DAG file (typically YAML or a Python module).

The output is then surfaced to the user as a visual graph or a structured file they can edit before running.

Key innovations#

What makes ChainBuddy more than “an LLM that emits Python”:

Separate phases for requirement-gathering and synthesis. Most pipeline-generation agents try to extract requirements and emit code in a single conversation — the result is partial pipelines and forgotten edge cases. ChainBuddy commits to “we are not generating any code until the spec is signed off.” The phase boundary is hard.
A structured spec, not a transcript. The Requirement Agent doesn’t carry the conversation forward as a long context; it carries a JSON spec. Each turn updates the spec. This makes the agent’s progress observable (you can read the current spec mid-conversation) and bounds the context size regardless of how long the conversation runs.
Per-node generators with typed interfaces. Splitting the synthesis across role-specialised generators is a multi-agent take on the classical “compiler-front-end / compiler-back-end” split. Each generator only needs to know its own node type. The combinatorics that defeated one-shot generation become tractable.
A separate Wiring Agent. Type-checking the graph is the most error-prone step — naive single-LLM pipeline generators frequently emit nodes whose interfaces don’t match. By making wiring its own agent with its own retry/repair loop, the system isolates and bounds that failure mode.
Editable output, not “trust me.” The pipeline is surfaced as a structured graph the user can inspect, not as a magic black box. This re-introduces human judgement at the right step — after the heavy lifting, before execution.

Evaluation#

ChainBuddy’s evaluation pairs three signals:

Spec-completeness. Did the agent extract a usable spec from the user? Measured against a held-out set of user-intent profiles, with success defined as “the spec compiles to a runnable pipeline.” Reported success rates around 80–90% on the in-distribution profiles, dropping on novel intents.
Pipeline correctness. Does the generated pipeline run end-to-end on representative data and produce a plausible score? Measured by execution success rate on a held-out task set.
User-effort proxy. How many turns to a confirmed spec? How many edits to the generated pipeline before the user ran it? The agent reduces both materially compared to a single-LLM baseline.

The evaluation harness uses both simulated user dialogues (an LLM persona-driven simulator) and a smaller human study. The pattern is the same as MACRS: simulator for iteration speed, humans for ground-truth calibration.

Trade-offs and limitations#

Where ChainBuddy works well. Domains where the output structure is a fixed schema (pipelines, forms, configuration files). Users who want guided help rather than one-shot generation. Workflows where the cost of a wrong output is high (running an expensive eval) and the cost of one extra clarifying turn is low.

Where it doesn’t help. Open-ended outputs without a schema (essays, designs). Users who already know exactly what they want and are slowed down by clarifying questions. Domains with so many possible node types that a per-node generator roster becomes unmanageable.

Other limitations:

Spec schema is the ceiling. The agent can only produce pipelines whose shape fits the predefined node types. Adding a new pipeline shape requires a new node generator and updates to the Planner. This is fine in a stable domain and painful in an evolving one.
The user has to know what “good” looks like. ChainBuddy surfaces the pipeline for editing, but if the user can’t evaluate the generated graph, the editability is theoretical. Onboarding documentation has to teach pipeline-reading before pipeline-editing helps.
Clarification fatigue. A user who wants speed will resent the multi-turn requirements phase. The system needs a “skip and generate from defaults” escape hatch — and getting the defaults right is its own design problem.
Cross-node coherence requires the Wiring Agent. Per-node generators don’t see each other; the Wiring Agent has to do the cross-cutting reasoning. When wiring fails non-trivially, the system retries — but pathological cases can produce loops.

Lessons#

Generalisable takeaways from ChainBuddy:

Phase boundaries are alignment tools. Splitting “extract requirements” from “generate output” with a hard handoff lets each phase have its own loop, its own retry policy, its own success criterion. Trying to do both in one conversation produces neither well.
Structured intermediate state beats long context. When an agent has to remember user preferences across many turns, give it a structured slot to write them into. The slot is observable; the long context isn’t.
Decompose generation by output structure. When the output is a graph of typed nodes, generating each node type with its own agent is cleaner than generating the whole graph in one call. The combinatorial blowup goes from O(everything) to O(per node).
Wiring is a distinct skill. If your output has internal consistency constraints (type matches, schema compatibility), give those constraints their own checker. Don’t expect a generator that’s busy generating to also check itself.
Editable outputs are an alignment safety net. Surfacing the generated artefact as something the user can inspect and modify lets you ship before the agent is reliable. “Confidently wrong code, run silently” is the bad failure mode; “confidently wrong code, presented for human review” is the recoverable one.

Why not just one big LLM call with a long system prompt?

We’ve tried it. One LLM call asked to “have a conversation with the user to extract their requirements, then generate a pipeline” produces a pipeline at turn 1, before any requirements are gathered, because the model is trained to generate when prompted. Phase-splitting fights the model’s eagerness — the requirement-gathering agent doesn’t have the pipeline-generation tools available to it at all, so it can’t end the phase early by accident. The architectural constraint enforces the workflow.