Design a Multi-Agent Medical Diagnosis System

A safety-critical multi-agent design: triage, diagnosis, second-opinion, and uncertainty handling for a hospital setting.

Exercise Advanced
11 min read
exercise healthcare multi-agent safety design

Scenario#

You’re the lead engineer on a clinical-AI team inside a mid-sized hospital network — five hospitals, around 400 beds total, roughly 90,000 emergency-department visits per year. Leadership wants a multi-agent diagnosis system embedded into the existing EHR. The brief, in their own words:

“When a patient is admitted, the agent should help the on-call physician work through differential diagnosis faster. It should not replace the doctor. It should not make decisions on its own. But it should be a meaningfully better assistant than what we have today, which is the doctor typing symptoms into a search bar.”

What “the agent” actually has to do, decomposed:

  • Triage on arrival. Look at vitals, chief complaint, and any structured intake fields. Suggest an acuity level (ESI 1–5) and surface the most concerning differential diagnoses for nursing to consider.
  • Differential diagnosis support. As labs, imaging, and notes accumulate, propose a ranked differential list with confidence levels, the supporting evidence for each, and the missing evidence that would discriminate between them.
  • Second-opinion check. Before an attending signs off on a working diagnosis, run an independent second-opinion pass — different prompt, different reasoning chain, ideally a different base model — and flag disagreements.
  • Uncertainty surfacing. When the system is not confident — when the evidence is thin, when the case is rare, when the labs are contradictory — say so loudly. Silence on uncertainty is worse than wrong.
  • Documentation assist. Draft the assessment-and-plan section of the note from the structured evidence. The physician edits and signs; the agent never signs.

Your job: design the system. Architecture, agent roles, tool surface, memory model, evaluation, failure handling. Whiteboard depth.

Constraints#

What’s fixed:

  • HIPAA and equivalent jurisdictional rules. No PHI may leave the hospital’s data perimeter. That means either an on-prem model or a BAA-covered hosted endpoint with audit logging. No raw PHI to a general-purpose API.
  • The EHR is the source of truth. Not the agent. Every fact the agent uses must be traceable back to a row in the EHR — lab value, vital, note, imaging report. Citations are mandatory.
  • No autonomous actions. The agent does not order labs, does not page consultants, does not write orders, does not commit anything to the EHR without an explicit physician click. Read-only proposals only.
  • Latency. Triage suggestions in under 5 seconds from intake completion. Differential updates in under 10 seconds after each new data point. Second-opinion check in under 15 seconds before sign-off.
  • Cost. Roughly $5 per encounter for inference. ED throughput is high; the unit economics fail above that.
  • Auditability. Every recommendation must be reproducible — same input, same output — and every prompt/response pair retained for at least seven years to match regulatory retention.

What’s variable:

  • Choice of base model(s). Choice of orchestration shape (sequential, supervisor-worker, debate). Memory layer. Whether to fine-tune, RAG, or pure prompting against the medical knowledge base.
  • The UI shape, within “embedded in the existing EHR’s side panel”.
  • Whether the second-opinion model is the same model with a different prompt or a genuinely different model.

What’s wishful (don’t assume):

  • That the EHR’s data is clean. It is not. Free-text notes contain abbreviations, typos, copy-pasted prior-encounter blocks, and contradictions.
  • That physicians will accept long explanations. They will skim. Output must be scannable in under 10 seconds.
  • That you can train on the hospital’s own data freely. You probably cannot, at least not without an IRB process and a data-use agreement.

Approach#

A reasonable architecture, in broad strokes:

┌──────────────────────────────────────────────┐
│ EHR side panel UI │
│ (triage card, differential, sign-off check) │
└──────────────────────┬───────────────────────┘
┌────────▼────────┐
│ Orchestrator │
│ (supervisor) │
└────────┬────────┘
┌────────────────────────┼─────────────────────────┐
▼ ▼ ▼
┌────────────────┐ ┌──────────────────┐ ┌───────────────────┐
│ Triage agent │ │ Diagnosis agent │ │ Second-opinion │
│ (fast, small) │ │ (capable, large) │ │ agent │
└───────┬────────┘ └────────┬─────────┘ └─────────┬─────────┘
│ │ │
└────────────────────────┼──────────────────────────┘
┌──────────────────────┐
│ Tool surface │
│ (read-only) │
└──────────┬───────────┘
┌───────────────────────────────────────────────────┐
│ EHR · Labs · Imaging · Knowledge base · Guidelines│
└───────────────────────────────────────────────────┘
┌──────────────────────┐
│ Audit + eval store │
└──────────────────────┘

Three sub-agents because the responsibilities are genuinely different:

  • Triage agent. Runs once at intake. Inputs: vitals, chief complaint, intake form. Output: ESI level proposal + top concerning differentials. Cheap model — this fires for every ED arrival regardless of acuity.
  • Diagnosis agent. Runs continuously as new data arrives. Maintains a running differential, ranks it, surfaces missing evidence. Capable model — the reasoning quality matters most here.
  • Second-opinion agent. Runs on demand, when the physician signals “I think I have a working diagnosis.” Different prompt template, different base model where possible, no shared memory with the diagnosis agent. The point is independent review.

A supervisor orchestrator owns the conversation state, decides which sub-agent fires when, and assembles the final user-facing summary. The sub-agents never talk to each other directly — they go through the supervisor — which keeps the audit trail linear.

The tool surface is read-only by hard rule:

  • get_patient_vitals(patient_id, time_window)
  • get_recent_labs(patient_id, time_window)
  • get_imaging_reports(patient_id, study_ids)
  • search_knowledge_base(query, specialty)
  • lookup_guideline(condition, version)
  • get_prior_encounters(patient_id, limit)
  • compute_score(score_name, inputs) — wrapped clinical calculators (Wells, HEART, etc.)

No order_lab, no page_consultant, no write_note. The diagnosis agent drafts a note; the doctor commits it.

Design decisions to make#

The decisions a good design conversation surfaces:

  1. Single agent vs multi-agent. A single capable model with a complex system prompt can technically do triage, diagnosis, and second-opinion. But the second-opinion role requires independence — same model with a different prompt is not independent in any meaningful statistical sense. The argument for multi-agent here isn’t ergonomic; it’s structural.

  2. Same model or different models for second opinion? If the hospital can stomach two vendors, use two — model-level disagreement is more meaningful than prompt-level disagreement. If not, at minimum use a different prompt + different RAG corpus + different temperature, and call out that the independence is weaker.

  3. Memory model. Short-term working memory per encounter is mandatory. Long-term cross-encounter memory is a regulatory minefield — a “the agent remembers this patient prefers conservative management” feature looks innocent and is a HIPAA violation waiting to happen. Keep memory encounter-scoped unless there’s an explicit clinical reason and an explicit DPIA.

  4. RAG vs fine-tuning vs prompting for medical knowledge. RAG over UpToDate / guideline corpora plus pure prompting on the case is the sweet spot. Fine-tuning hospital data is a long, expensive, regulatory-heavy path with marginal accuracy upside — defer it past v3 if ever.

  5. What’s the uncertainty representation? Free-text “I’m not sure” is unusable. Better: each differential gets a calibrated confidence (low / medium / high) plus what evidence is missing. Calibration matters — if “high” doesn’t mean ~90% accurate empirically, the doctors will stop trusting any signal.

  6. Eval harness. Three buckets:

    • Retrospective, on a frozen set of resolved cases — grade differential overlap with the discharge diagnosis.
    • Prospective shadow mode, where the agent runs alongside but never displays to physicians — collect agreement / disagreement rates, latency, cost.
    • Adversarial probe set of rare diagnoses and look-alike presentations to test whether the agent fails gracefully (says “uncertain”) vs confidently wrong.
  7. Failure modes. What happens when the EHR API times out mid-diagnosis? When the knowledge base returns conflicting guidelines? When the second-opinion agent disagrees strongly? Each needs a designed path — not a hallucinated one.

  8. Human-in-the-loop pattern. This is not “the agent acts, the human approves.” It’s “the agent proposes, the human reasons, the agent supports the human’s reasoning.” The agent should never present a single answer with high confidence to a tired resident at 3 a.m. — it should present a list with reasoning and gaps. The framing matters.

Trade-offs to discuss#

On-prem deployment. Strongest privacy story. PHI never leaves the hospital network. Costs: capital expenditure on GPU infrastructure, slower iteration, smaller model selection, harder to update.
BAA-covered hosted endpoint. Faster iteration, larger model choice, no infra to manage. Costs: contractual rather than physical privacy boundary, vendor lock-in, residual breach risk in transit.
One general medical agent. Single prompt, one reasoning chain. Simpler to evaluate end-to-end. Worse at specialty-specific edge cases.
Specialty-routed agents. Cardiology, neurology, pediatrics each get a tuned prompt and a tuned RAG corpus. Better at edge cases. Routing errors at the front door cascade.

Other axes to surface:

  • Confidence calibration vs hedging. Models trained for safety often hedge on everything — “this could be X, or Y, or Z” — which is useless. The eval harness must penalise both overconfident and uniformly-hedged outputs. Calibration is a first-class metric, not a vibes check.
  • Explanation depth. Long explanations are unread; short ones are unfalsifiable. Sweet spot is a one-line proposal + an expandable “supporting evidence / against / missing” panel.
  • Real-time vs polling. Push-based updates when new labs arrive feel modern but generate noise — every electrolyte panel triggers a refresh. Better: agent runs on a cadence (e.g. every 5 minutes) or on physician-triggered “refresh” — both are testable, neither is firehose-y.
  • Bias and fairness. Differential diagnosis quality varies by demographic group in published models. The eval harness must stratify accuracy by age, sex, ethnicity, and language at minimum. Aggregate accuracy is a lie if the subgroup numbers diverge.
  • Cost tiering. Triage on the cheapest model, diagnosis on a capable one, second-opinion on the largest. Don’t pay GPT-class rates for “is this person sick enough to skip the waiting room.”
  • Documentation drafting. Tempting to let the agent draft the entire note. Safer: draft only the assessment-and-plan, leave HPI / ROS / physical exam to the physician. The drafted section gets a visible “AI-assisted” badge in the EHR.

Evaluation criteria#

A passing answer:

  • Has a clear architecture — supervisor + sub-agents with distinct roles, tool surface explicit, data flow drawn.
  • Names the toolsget_recent_labs, search_knowledge_base, compute_score. Not “the agent looks at the patient.”
  • Respects the no-autonomous-actions constraint — every commit goes through a physician click. Drafts only.
  • Handles uncertainty as a first-class output — calibrated confidence levels, missing-evidence callouts, graceful unknown.
  • Has an evaluation strategy — retrospective + shadow + adversarial, with subgroup stratification.
  • Names the privacy boundary — on-prem or BAA-covered, audit log, encounter-scoped memory.

A strong answer adds:

  • A phased rollout — v1 read-only triage card, v2 differential support in shadow mode, v3 second-opinion check, v4 documentation drafting. Each phase has its own eval gate.
  • An explicit failure mode catalogue: what if the model hallucinates a lab value? Cites a guideline that doesn’t exist? Disagrees with itself across sub-agents? Each scenario has a designed response.
  • Bias and fairness as a launch gate, not a post-launch metric. If subgroup accuracy diverges past a threshold, the system doesn’t ship.
  • An incident-response plan. When (not if) the system makes a clinically meaningful error that contributed to a near-miss, what’s the disclosure path? Who pulls the on-call rope?
  • Awareness that this design is socio-technical. The hardest part isn’t the orchestration; it’s the workflow integration, the physician trust calibration, the medico-legal liability allocation.
The single design choice interviewers keep returning to

The second-opinion agent’s independence. A weak design uses the same model with a slightly different prompt and labels it “second opinion” — that’s marketing, not safety. A strong design uses a different base model where possible, a different knowledge-base corpus, an explicitly different reasoning style (e.g., one that starts from the differential and rules in, one that starts from the working diagnosis and tries to rule out), and measures the conditional independence of their errors. If the two agents fail on the same cases, you have one agent with two faces. If they fail on different cases, you have a meaningful safety check. The point of the second opinion is decorrelated errors — and “decorrelated” is an empirical claim you have to verify, not a design claim you get to assert.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.