Building my first real multi-agent loop

The euphoria of a single agent successfully completing a task lasts about ten minutes. My first one parsed a messy log file and summarized the critical errors. It felt like magic. So, naturally, I decided to chain another agent to it. That’s when the magic vanished and the plumbing burst.

While many agent frameworks offer powerful abstractions, I quickly found they often added unpredictability where I needed rock-solid control. The path to a working system involved stripping things down, not building them up.

The Illusion of Collaboration

My initial design was simple, almost naive. Agent One, the "Analyst," would read the log. Agent Two, the "Writer," would take the Analyst's output and compose the summary. I imagined them passing a work ticket between specialists. In reality, it was two black boxes attempting to communicate through a keyhole.

The Analyst produced a raw JSON dump. The Writer agent, expecting clean prose, produced a garbled mess. It had no access to the Analyst's "thought process" or the original request's intent. Fixing this felt like a game of prompt engineering whack-a-mole. I wasn't orchestrating collaboration; I was mediating a frustrating, token-burning argument between two models.

Naive vs. Orchestrated Agent Flow

The State Is the Contract

The breakthrough came when I stopped thinking about agents talking to each other. Instead, I had them read from and write to a central, structured "job sheet." This shared state object became the single source of truth for the entire multi-step task.

It's a living document, a JSON object containing the original prompt, the current step, a log of actions, and a dictionary of outputs. The `scratchpad` is critical. It mirrors the "thought" process from foundational patterns like the ReAct paper, giving each agent a log of what the previous steps were thinking and why. Now, the Writer agent receives the entire state. Its prompt can reference specific artifacts and the original goal, providing the context that direct agent-to-agent communication could never reliably achieve.

The Orchestrator Is Not an Agent

My next mistake was trying to make the "manager" a third, smarter LLM. This "Orchestrator" agent was a disaster. It would get stuck in loops, re-running the Analyst agent over and over. I was asking a creative, probabilistic tool to do a job that required absolute, boring predictability.

The solution was to replace it with a simple, deterministic state machine—a Python function. While some frameworks like Microsoft's AutoGen orchestrate complex tasks through simulated conversations between agents, I found that for a predictable production workflow, that freedom introduced too much non-determinism. A simple, programmatic state machine proved far more reliable. Its logic is dead simple: read the current step from the state, call the right tool, update the state, set the next step, and loop.

Designing for Failure and Correction

What happens when an agent produces bad work? In a loop, this risk is magnified. My system needed an immune response. I introduced a deterministic "Validator" step after any agent that produced structured output, like a Pydantic model that parses the JSON. If it fails, the validation fails. No ambiguity.

When validation fails, the orchestrator appends an error to the state log, increments a retry counter, and sends the task *back* to the same agent with the error message included in the context. "Your last output failed validation. Please correct it." This creates a self-correction loop with an escape hatch. If the retry counter exceeds a threshold, the orchestrator moves the job to a failed state and stops. It prevents infinite, expensive loops.

Production Multi-Agent Architecture

My Practical Takeaways for Production

Building this system stripped away the hype and revealed the engineering discipline required. It’s a microcosm of the convergence of software, data, and AI. The flashy part is the LLM, but the architecture is what holds up at 3am.

Govern with a dumb robot. Your orchestrator should be the most predictable part of your system. Use a state machine, not a master LLM, to control the workflow.
The state is the contract. A centralized, structured state object is the only reliable way to manage context and communication between agents. Make it the single source of truth.
Plan for failure, not just success. Embed deterministic validation and finite retry loops. Unchecked loops are budget black holes. Know when to let the system give up and escalate.
Let agents be fuzzy, keep flow rigid. Use LLM agents for what they do best: generating, summarizing, and transforming unstructured data. Use deterministic code for everything else.

Moving from one agent to many isn't an additive process; it's a multiplicative one in terms of complexity. Taming it requires leaning on the oldest patterns in our toolkit: clear state management and deterministic control flow.