What the data years taught me that I still use in agentic systems

I once watched a multi-step agent task fail on step seven of nine. It was supposed to analyze a repository, suggest a fix, open a pull request, and tag a reviewer. A GitHub API call returned a transient 503 error. The agent retried, but its state became confused. It lost the context of the original goal and started a new, unrelated task based on the error message. This isn't a problem with the LLM; it's a failure of architecture.

The hype around agentic systems focuses on the magic of the reasoning loop. In production, however, the system's reliability is defined by its most fragile connection: the tool call. The failure modes look exactly like the challenges we faced building the first generation of Hadoop-based ETL jobs. The solutions we developed then are precisely what these new agentic systems need.

What the data years taught me that I still use in agentic systems

Agents Are Brittle Distributed Systems

An LLM agent is an orchestrator. It breaks a large goal into a sequence of API calls. This makes any agent a distributed system, subject to network latency, transient errors, and unreliable dependencies. The agent's core asset is its state: the memory of what it has done and what it plans to do next. When a tool call fails, the common approach is to feed the raw error back into the context.

Frameworks have started offering ways to handle this. The LangChain documentation on handling tool errors, for instance, shows how to return an error message to the model. This is a necessary first step, but it’s not a production strategy. A cryptic 502 Bad Gateway pollutes the agent's memory, consumes expensive context tokens, and can lead it down a non-recoverable path. It asks the LLM to solve a classic network engineering problem, which is the wrong job for it.

The Naive Agentic Tool Call

This is the same wall we hit with early data pipelines. A single malformed record could halt a job that had run for hours, forcing a manual restart. We learned the hard way: the control flow must be insulated from the imperfections of the network and its endpoints.

Idempotency: Your First Line of Defense

The first principle from the data world is idempotency. An operation is idempotent if running it multiple times has the same effect as running it once. When an API call times out, you don't know if the request was processed. Simply retrying is dangerous—you might create two pull requests or charge a customer twice.

The solution is an explicit idempotency key. The best real-world example of this is in the Stripe API documentation, a gold standard for reliable service design. Before an agent attempts a tool call, the surrounding harness generates a unique identifier for that specific operation. This key is sent in the request header. If the server sees the key for the first time, it processes the request. If it has seen the key before, it skips processing and simply returns the saved result from the original request.

This pattern makes retries safe. The harness can retry a failed connection, and you are guaranteed not to trigger duplicate actions. It moves the responsibility for handling network ambiguity out of the LLM's reasoning loop and into a deterministic, reliable layer.

The Dead-Letter Queue for Persistent Failures

Retries handle transient failures. But what about persistent ones? An invalid API key, a malformed request, or a bug in the endpoint won't be fixed by retrying. This is where another data engineering workhorse comes in: the dead-letter queue (DLQ). A DLQ is a holding bay for messages or tasks that a system cannot process. As the AWS SQS documentation on dead-letter queues explains, this prevents a single bad message from blocking the entire queue.

In an agentic architecture, the execution harness should wrap every tool call with this logic. After three failed retries, it should stop. It then packages the entire context of the failed call—the tool name, inputs, idempotency key, the final error—and pushes it to a DLQ. Crucially, it returns a clean, structured error to the agent, like ToolFailed: 'GitHubAPI' failed persistently. The attempt has been logged for review. This gives the agent a clear signal it can reason about. It can try an alternative tool, ask for help, or terminate gracefully. The state remains clean.

Why Not Let the Agent Figure It Out?

There's a temptation to believe a powerful enough LLM could debug these issues on its own. The "pure agentic" approach might suggest building a meta-agent that inspects stack traces and network logs to form a recovery plan. In my experience, this adds immense cost and unreliability for a problem we already have deterministic solutions for.

Patterns like idempotency keys and DLQs are components of a broader strategy for resilient systems. They work alongside other battle-tested patterns like the Circuit Breaker, which Martin Fowler documented years ago. A circuit breaker prevents an application from repeatedly trying to execute an operation that is likely to fail. This robust harness lets the LLM do what it's good at—reasoning and planning—while letting deterministic software handle the inevitable failures of a distributed world.

Resilient Agentic Architecture

What This Buys You in Production

Adopting these "boring" patterns from data engineering gives you immediate, tangible benefits that are critical for any serious agentic system. They separate the probabilistic reasoning core from the deterministic execution shell.

Reliability: The system withstands network blips and transient API outages without corrupting state or failing the entire high-level task.
Observability: Your DLQ becomes a prioritized inbox for debugging. You can see exactly which calls are failing, why, and with what inputs, without digging through terabytes of logs.
State Integrity: The agent's precious context window is never polluted with stack traces. Its memory remains focused on the goal, not the transport layer.
Maintainability: When an external API changes, you have a clear, isolated queue to diagnose and replay the failed attempts once a fix is deployed.

The new wave of AI doesn't make old problems obsolete. It just gives them new names. The challenges of building reliable and maintainable systems are the same as they've always been. The good news is, we already have the blueprints.