Most questions don't need the swarm: what medicine's agentic AI taught me

In short

A NeurIPS 2024 medical-AI paper, MDAgents, has one disciplined idea worth stealing. Before it answers, it classifies how hard the question is, then assigns the collaboration structure to match: a single agent for the routine, a multidisciplinary team for the moderate, an integrated team-of-teams for the genuinely hard. It wins on 7 of 10 benchmarks while keeping the cheap path cheap. The accuracy comes from the collaboration. The part you can actually deploy is the escalation policy that decides when not to convene the swarm. And the soft spot, the thing to watch, is the model that makes that call.

Walk into a clinic with a headache and no one convenes a tumor board. A nurse handles it in ninety seconds. Walk in with an ambiguous mass on a scan and the same building does something completely different: it pulls a radiologist, an oncologist, maybe a pathologist into a room, lets them disagree, and converges on a call. Same hospital, wildly different amounts of machinery, and the amount is chosen before anyone treats the case, based on how hard the case looks.

That triage instinct is so obvious in medicine it goes unnoticed. It is not obvious in how we build AI agents. The default right now is the opposite: throw a committee of language models at everything, let them "debate," and hope the averaging helps. MDAgents, "An Adaptive Collaboration of LLMs for Medical Decision-Making," out of MIT, is the first paper I have read that treats that instinct as the actual architecture and then measures what it buys.

The one idea

Most multi-agent papers bolt the same heavyweight committee onto every question. MDAgents inserts a step before the committee exists. First it runs a complexity check on the medical query and labels it low, moderate, or high. Only then does it pick the collaboration structure, and the paper gives each tier a clinical name:

The whole paper in one move: assess difficulty first, then assign the collaboration structure to match it. The names are clinical on purpose.

The moderate tier, the multidisciplinary team, is the one most people picture when they hear "multi-agent": role-playing specialist agents are recruited, they analyze the case in rounds, and a moderator drives them toward a single consensus answer. The high tier, the integrated care team, goes a step further: several teams each produce a report, then a final review stage reconciles them, and external medical knowledge is pulled in before the decision lands. Each rung costs more, and the system only climbs when the complexity check says the case warrants it.

Inside a hard case: who actually runs the loop

The detail an architect should notice is the orchestration. A consult is not a model free-associating until it feels finished. A moderator runs a bounded sequence: recruit the roles, let them reason, surface and reconcile disagreement, synthesize one answer, stop. The judgment lives inside the agents. The control flow around them is fixed, readable code.

The team runs on rails. The agents bring judgment. The orchestrator brings the loop, the budget, and the stop condition.

What it measured, and the wrinkle worth reading for

MDAgents is tested across ten medical benchmarks: MedQA in the USMLE style, MedMCQA, PubMedQA, the medical slices of MMLU, and several multimodal sets. It reports best performance on 7 of 10, up to a 4.2% gain over the prior best method, and a larger jump of about 11.8% on average when the moderator's review is paired with external medical knowledge. The efficiency case is explicit too: the solo tier is enough for easy cases, so you do not pay committee prices on questions that do not need them.

Now the part a real review owes you. Read the method section and one number gets quietly load-bearing: the claim that the complexity check picks the right tier "with probability at least 80%." Follow it to the source and that confidence rests on a small probe, twenty-five MedQA questions, each run ten times, with the headline coefficient reported as 0.81 plus or minus 0.29. A standard deviation of 0.29 on the number that justifies the entire router is a soft empirical floor. It is not a flaw that sinks the paper. It is exactly the spot an architect should put a finger on, because everything good about this design assumes the router is usually right, and "usually right, measured on twenty-five questions" is a thinner foundation than the framework's elegance suggests.

Why this beats the committee papers before it

The multi-agent wave that came first already showed that collaboration can beat a lone model on hard reasoning. MedAgents, from 2024, is the clean version of that story. But nearly all of them share one assumption: the structure is fixed. Same number of agents, same debate, every time. Put them side by side and MDAgents' contribution gets sharp.

Approach	Structure	On an easy question	On a hard question	What it optimizes
Single LLM	one model	cheap, fine	brittle, overconfident	latency and cost
Fixed multi-agent (e.g. MedAgents)	same committee always	expensive overkill	strong	accuracy on hard cases
MDAgents	chosen per query	stays solo	convenes a team	the right amount of agency

The new idea is not a better debate. It is the policy sitting above the debate, a rule that picks how much collaboration to spend and is willing to pick "one model, no debate." That willingness is the contribution.

The part the hype gets backwards

Everyone copies the swarm. The hard part is the policy that decides not to summon it.

Be precise here, because this is easy to overstate. The accuracy in this paper comes from the collaboration. The biggest measured win, that 11.8% jump, is a team plus grounding result, not a routing result. Multi-agent deliberation genuinely works on hard problems, and I am not going to pretend otherwise. Agents are not theater.

What the field keeps celebrating, though, is the wrong half. The exciting half is "specialist agents debate like a tumor board." The half that makes this deployable in a setting where a wrong answer has a body attached is the quiet classifier that, on most inputs, looks at the case and routes it to one agent. Every demo wants to show you the swarm. The discipline is in the cases where it never appears.

Because in production the committee is not free, and this paper hands you the receipt. The solo tier runs in about 14.7 seconds. The moderate team takes about 95.5. The high tier runs around 226 seconds, roughly fifteen times the solo cost, before you count the extra tokens and the extra surface area for confident error to compound. Three agents agreeing can be three agents wrong together. So "more agents" is a line item, not a strategy. The strategy is knowing which questions are worth it.

That is the split worth naming cleanly. The collaboration buys the accuracy. The routing buys the deployability: the bill stays small, the escalation is logged, and the cheap path is the default. MDAgents' lasting engineering lesson is not that committees win. It is that a system can learn when to keep its hands in its pockets, and that restraint is what lets you put the committee anywhere near a real workload.

Why an architect should care (this is not a medical post)

I do not review a medical-AI paper because I am building a doctor. I review it because medicine is the most unforgiving production environment there is, a place where a wrong output has a cost you cannot refund, and the patterns that survive there are the ones I want around my agents everywhere. Strip the white coats off MDAgents and you are left with a short checklist for any agentic system you would put behind an SLA.

Route before you reason. Score the task's difficulty and spend compute to match it. The cheapest correct path is the default. Escalation is a deliberate, logged decision, not a reflex.
The orchestrator owns the loop, never the model. Recruit, deliberate, ground, synthesize, stop. All of that is control flow you can read, test, and bound. The model is the resident. Your code is the attending who calls the end of rounds.
Grounding has to be load-bearing. The biggest measured jump came from pairing moderation with external knowledge. Retrieval the decision actually depends on, not a citation stapled on afterward, is where the points live.
Build the "do not act" path and measure it. The most production-relevant behavior in the whole paper is the system choosing the small, cheap answer, or by extension choosing to abstain. And the metric that matters is not "did the committee win." It is how often you correctly avoided the committee, and what that saved.

The humbling part is that we already run this by hand. Every time I put a hard architecture call in front of a panel of five frontier models and reconcile them to a consensus, that is the high tier. Every time I let one model answer a small thing, that is the router staying solo. MDAgents wrote down and measured the policy I had been running on instinct. The honest next move is to make the router explicit, and then to remember that the router itself is a model I have to keep honest.

Our open questions

Who guards the router? The whole design assumes the complexity check is right, and the paper only validates that on twenty-five questions, at 0.81 plus or minus 0.29. A miscalibrated router under-escalates the dangerous case or wastes a team on a trivial one. In medicine that asymmetry is life and death, so what does a deliberately conservative router look like, and how do you price a wrong triage?
Does consensus launder error or expose it? A moderator driving agents to agreement is comforting, but agreement is not correctness. When three specialist personas converge confidently on the wrong answer, is that more or less likely than a single model getting it wrong alone?
Who signs the chart? In a real tumor board a named attending is accountable for the call. If our router quietly stays solo on a case that needed a team, and it is wrong, whose name is on the log? The routing decision, not just the answer, needs an owner.

Read the original paper

Worth the time. The routing diagrams and the complexity taxonomy are the heart of it, and the ablations are where the real story lives. Read it here without leaving:

PDF

MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making

Kim, Park, Jeong, Chan, Xu, McDuff, Lee, Ghassemi, Breazeal, Park · MIT · arXiv:2404.15155 · NeurIPS 2024

⤢ Expand Open paper ↗

✕ Close

Embedded from arXiv (open access). Tap ⤢ Expand to read it fullscreen right here, or “Open paper ↗” for a new tab.

MIT

Yubin Kim, Hae Won Park, Cynthia Breazeal, Marzyeh Ghassemi & colleagues

MIT Media Lab (Personal Robots) and MIT CSAIL (Healthy ML), with collaborators at Google and Seoul National University Hospital

The team behind the paper

This is not a pure-NLP group. It comes out of the intersection of human-robot interaction (Breazeal's Personal Robots lab) and clinically grounded machine learning (Ghassemi's Healthy ML). That pairing is why the framing is about collaboration structure and trust, not just leaderboard points. The people who study how humans and machines should defer to each other wrote a paper about when an AI should call for help.

Cynthia Breazeal pioneer of social robotics (Kismet), MIT Media Lab. The "agents as collaborators" lens starts here.
Marzyeh Ghassemi MIT CSAIL, Healthy ML. Rigor about machine learning that touches real patients, including when models should not be trusted.
Yubin Kim & Hae Won Park lead authorship from the Media Lab. The adaptive-collaboration framework is the contribution.
Apr 2024 posted to arXiv (2404.15155).
NeurIPS 2024 presented at the field's top venue, the credibility line worth keeping.