Lessons from building a multi-agent system over millions of records

The demo worked perfectly. An "analyst" agent scanned a few hundred records, a "strategist" agent identified a pattern, and a "writer" agent drafted a summary. Then came the real task: run this same logic over tens of millions of records, with failures, retries, and a hard deadline. The demo's architecture fell apart in the first hour.

This is the gap I see everywhere right now. The leap from a proof-of-concept to a resilient production system is massive. When you add the non-determinism of LLMs, the problem multiplies. After building and rebuilding such a system, I’ve learned the winning patterns have less to do with prompt engineering and more to do with boring, durable data architecture.

Agents as Stateless Workers

The first mental model to discard is that of a "team" of autonomous agents chatting amongst themselves. At scale, this is a recipe for chaos, runaway costs, and untraceable errors. The agent isn't the system; it's a single, specialized component within the system.

This is a classic Worker Pattern, common in distributed systems for decades. In this design, the LLM-powered agent is just one step in a much larger, more predictable pipeline. The system around it is responsible for the hard parts: fetching data, managing state, routing tasks, and handling failures. We treated each agent as a stateless, idempotent worker that pulls a job from a message queue. The job contained the data, the prompt, and the unique ID for that unit of work. The agent did its one thing—summarize, classify, extract—then wrote its structured output back. Collaboration happened asynchronously, orchestrated by the flow of data, not by a conversation.

Simple Agentic Worker Flow

Externalize State, Tame Chaos

LLMs are fundamentally stateless. A context window is not a database. Relying on it to maintain state across millions of records is like using a sticky note to manage a warehouse inventory. It fails immediately and catastrophically.

When an agent processes record #1,500,000, how does it know what happened with the first million? The answer must be that the state lives entirely outside the agent. For every record, we implemented a transactional state machine in a simple, robust database table. The lifecycle was explicit: `PENDING` → `PROCESSING` → `COMPLETED` / `FAILED`. This is one of the foundational ideas in reliable systems, well-documented in resources like Gregor Hohpe's Enterprise Integration Patterns. It's an old solution, but it’s the right one.

Before an agent worker picked up a job, it would lock the corresponding row. If the agent succeeded, it wrote its output and updated the state to `COMPLETED`. If it failed after three retries, it was marked `FAILED`, and the inputs and error were logged for manual review. This turns the unpredictable agent into a transactional step, making the entire process resilient to failure and transparent to operators.

Orchestration: Deterministic DAGs over Agentic Chat

The idea of a "router" agent directing traffic to other agents is compelling in demos. Frameworks emerging from research, like Microsoft's AutoGen, have popularized multi-agent conversation as a powerful paradigm. For creative or exploratory tasks, it's promising. But for high-volume, repetitive data processing, we found this approach introduces a dangerous source of non-determinism.

Instead, we relied on a rigid, deterministic orchestrator—a piece of regular code, not another LLM. The "collaboration" was just data flowing through a directed acyclic graph (DAG). This is the same principle behind data workflow tools like Apache Airflow. The output of one agent becomes the input for the next, but the routing is handled by deterministic code based on structured data. The agents don't need to "talk" to each other; they just need to agree on the input and output schemas for each task. This makes the system testable, predictable, and far easier to debug at 3am.

Auditability is Non-Negotiable

When an agent produces an incorrect output for record #4,582,109, you have a serious problem. The non-determinism of LLMs means you can't just re-run it and expect to see the same bug. Without a perfect record of what happened, debugging is impossible.

For every single LLM call, we logged everything transactionally:

The unique job ID.
The exact, complete prompt sent to the model.
The model name and API parameters (temperature, etc.).
The raw, unparsed response from the API.
The parsed, structured output after our validation layer.
The token counts and calculated cost.
A timestamp and the worker ID.

This level of logging is the bedrock of modern LLM-ops and observability. It feels like overkill at first, but it becomes the system's ground truth. It turns a magical, un-debuggable problem into a data analysis problem. When a bad result is found, we can instantly retrieve the entire chain of events and isolate the failure point.

Production Multi-Agent Architecture

Durable Principles for Agentic Systems

The principles that make these systems work are the same ones that have governed distributed data processing for decades. The agentic component is new and powerful, but it must be tamed and integrated with discipline. My key takeaways are simple and, frankly, a bit boring.

Treat Agents as Idempotent Workers. The agent is a tool, not the architect. The architecture is queues, databases, and simple orchestrators.
Manage State Transactionally, Outside the Agent. Never trust an agent's context window to manage the state of a long-running process. Use a real database.
Orchestrate with Deterministic DAGs, Not Conversations. Agent "coordination" at scale is just message-passing with strong contracts. Schemas are those contracts.
Build an Immutable Audit Log for True Observability. Your audit log is your most important debugging tool. Without it, you are flying blind.

The future of these systems is bright, but it will be built on the foundations of solid software and data engineering, not on the hope of autonomous magic.