Designing for failure before failure found me

For years, the pager going off at 3 AM taught me about failure. A dropped network packet, a database replica lag, a full disk—these were the classic ghosts in the machine. We learned to exorcise them with patterns that are now foundational. But recently, the ghosts have changed. They’re subtler, more expensive, and far less predictable. They are the failure modes of large language models integrated into deterministic systems.

The mistake I see teams making is treating an LLM API call like any other REST service. It’s not. It has non-deterministic latency, its failure modes include semantic gibberish, and every retry costs real money. Applying the old resilience patterns isn't enough; we have to adapt them for this new, agentic world. The core principles, born from acknowledging the classic Fallacies of Distributed Computing, are more relevant than ever, but their application needs a serious update.

Hybrid System Request Flow

Retries Are Now a Budgeting Problem

The first tool we reach for is the retry. But a naive retry against a flaky LLM endpoint is a fantastic way to burn through your budget before your morning coffee. If a call costs two cents and your service retries five times across a hundred concurrent requests, you’ve just amplified your cost by 5x for zero gain.

This is where the classic pattern of exponential backoff with jitter is still the right start. It’s a technique proven to prevent self-inflicted denial-of-service attacks, as detailed in an excellent post by the AWS Architecture Blog. It gives the downstream service time to recover. But for agentic systems, we have to add another layer: idempotency and cost control.

Before you even consider a retry, you must know if the operation is idempotent. Retrying a simple data lookup is safe. Retrying an agentic call that posts a message to a third-party API is a disaster. You have to design your agent's tool-use functions to be safely repeatable or not repeatable at all. This has become a first-order design concern, not a clean-up task.

The Circuit Breaker for Semantic and Financial Safety

Retries handle transient blips. For systemic outages, we need to stop making calls entirely. This is the job of the circuit breaker, a pattern masterfully defined by Michael T. Nygard in his book Release It!. The breaker monitors for failures, and if they exceed a threshold, it trips "open," failing subsequent calls fast without even making a network request. This protects the calling service from wasting resources on a dependency that's down.

With LLM agents, the breaker's logic gets more interesting. It shouldn't just trip on network errors or 500s. It should also trip on a spike in P99 latency, or on a series of responses that fail semantic validation—responses that are syntactically valid JSON but contain nonsense. It can even be wired into a cost observer, tripping if a particular agent's daily budget is exceeded. The circuit breaker becomes a tool for financial and logical safety, not just network resilience.

This pattern provides an escape hatch, allowing the system to preserve its core functionality when its agentic "brain" is unavailable or malfunctioning.

Graceful Degradation in Hybrid Systems

When a circuit breaker trips, what happens next defines the quality of your architecture. For a system blending deterministic and agentic work, this is where you can truly shine. Graceful degradation means separating the essential from the enhancement.

I worked on a system that generated a report. Ninety percent of it was structured data pulled and aggregated from databases—a deterministic pipeline. The last ten percent was a natural language summary generated by an LLM. The LLM was a powerful feature, but not essential. Our architecture treated it that way.

The call to the summarization agent was wrapped in a circuit breaker. If the breaker was open, the orchestrator simply skipped that step and rendered the report with only the structured data. The user got their critical numbers and charts, just not the convenient summary. It was a degraded experience, but not a failed one. The business function was preserved.

Application vs. Platform Resilience

A fair question is where to implement these patterns. Should every developer be writing their own retry logic? For most teams, the answer is no. You can bake these patterns into shared client libraries within your application code, which gives you fine-grained control.

Alternatively, a modern service mesh like Istio or Linkerd can provide much of this functionality at the platform level, transparently. This is a powerful approach, but it's a trade-off. You gain operational simplicity at the cost of application-specific context. A service mesh knows about HTTP 503 errors, but it doesn't know that your LLM response failed a semantic check or that a specific call exceeded its cost-per-query limit. For hybrid systems, I've found a combination works best: use the platform for network-level resilience and application-level breakers for the specific, contextual failure modes of your AI components.

Resilient Hybrid AI Architecture

What Holds Up in Production

Designing for failure in 2026 is about more than just network reliability. It’s about building systems that are resilient to the novel failure modes of agentic components. The core principles remain the same, but the implementation requires a new level of awareness.

Treat Agent Calls as Fragile: An LLM is your most expensive, slowest, and least predictable dependency. Architect your system to withstand its failure.
Degrade to Deterministic: Build your system so that if the agentic parts fail, it can fall back to its deterministic core. The boring, structured data path is your safety net.
Monitor Cost and Semantics, Not Just Uptime: Your circuit breakers need to be smarter. They should trip on budget overruns and logical failures, not just timeouts.
Idempotency is Non-Negotiable: Before you build a retry loop around an agent that can perform actions, you must have a rock-solid answer for what happens when it runs twice.

Anticipating these failures isn't pessimism. It's the craftsmanship required to build things that last, moving our work from a cool demo to a durable, production-ready system that still works at 3 AM, no matter what ghosts are in the machine.