jcardena.com Blog My first production bug at 3am, and what it taught me
145 posts
EN ES

My first production bug at 3am, and what it taught me

Web

A 3am production bug in a deterministic batch job taught me foundational lessons in defensive design and observability that are now critical for building agentic AI.

The 3am pager is a special kind of dread. Mine came on a Tuesday, for a system that was supposed to be boringly predictable: a nightly batch job processing international orders. It was a deterministic pipeline, the kind of software we're told is safe and reliable. But for five hours, it had been completely silent. No errors, no alerts, just a void where our European fulfillment data should have been. Staring at that blinking cursor, I felt the cold truth that even the simplest deterministic systems can fail in profoundly mysterious ways.

The Anatomy of a Silent Failure

Panic is the first instinct. The second is to start pulling threads. I checked the logs, but there was nothing. The process hadn't logged a fatal error; it had simply ceased to exist mid-run. System resources were idle. The database was responsive. This wasn't a loud crash, but a quiet, baffling disappearance. This is the kind of failure that erodes your confidence in the tools, the code, and yourself.

Our monitoring back then was primitive. It could tell us the heart had stopped beating, but not why. I was flying blind, guessing at causes. The lack of detailed, structured logging—the exact problem that practitioners like Charity Majors would later help the industry solve through the discipline of observability—turned a ten-second diagnosis into a two-hour ghost hunt. Every minute spent guessing was another cohort of angry customers and another step deeper into a costly outage.

Ingest BatchEnrich DataProcess PaymentsQueue Fulfillment
The Assumed Happy Path

An Epoch-Zero Bug in a Deterministic World

The break came from manually re-running the job, feeding it one record at a time. For nearly two hundred orders, it worked. Then, one record from a newly onboarded country caused the script to exit without a trace. The record had a single `null` value in a `user_region_profile` field. The upstream enrichment service hadn't populated it yet.

My code didn't crash on the null. Instead, a timestamp formatting library, when given a null region, silently defaulted to an epoch-zero date: `1970-01-01T00:00:00Z`. This poison data was then passed to a downstream message queue. The consumer service, built defensively, saw this impossible timestamp from the past, correctly flagged it as corrupt data, and locked the record for manual review. But in doing so, it stopped processing the entire queue behind it. A single null value hadn't caused a crash; it had created a silent, cascading stalemate across the whole system.

Patterns That Survive the 3 AM Pager

The fix was a two-line null check. The lessons, however, have shaped every system I've built since. That night taught me to value the boring, durable patterns over the trendy and fragile. It sent me searching for a vocabulary to describe these production realities, which I later found codified in Michael T. Nygard's seminal book, Release It! His stability patterns weren't abstract theory; they were the direct codification of lessons learned from 3am fires just like mine.

Three principles became non-negotiable for me. First, **every input is hostile.** Trusting an upstream service was my original sin. Defensive validation isn't boilerplate; it's the foundation of production-ready systems. Second, **build for debuggability.** Observability isn't a feature you add later. It is the system. And third, **know your dependencies' failure modes.** I had used a library without understanding its behavior on bad input. Its silent failure was my silent failure.

From Deterministic Bugs to Agentic Chaos

This story of a simple data bug might seem quaint. But that lesson is a hundred times more critical now that we compose deterministic pipelines with non-deterministic LLM agents. The "input" for an agent isn't a clean record, but a messy swirl of natural language, context windows, and tool outputs. An agent's failure mode isn't a predictable epoch-zero bug, but a confident hallucination or a subtly incorrect tool call that generates plausible, yet wrong, data.

The happy-path demos for new agentic frameworks rarely show you their equivalent of an epoch-zero bug. They don't highlight how an agent can fail silently, producing outputs that are just plausible enough to poison downstream systems in ways you won't discover for weeks. As practitioners like Chip Huyen have documented, the operational rigor required to run LLMs in production is immense precisely because their failure modes are so vast and non-deterministic. The frantic search for that null value taught me a healthy skepticism of any system that can't be easily observed. If a simple, deterministic job could hide its failure so well, we must be relentlessly vigilant about the far more complex and opaque systems we build today.

SOURCESEvent StreamsDatabasesUser PromptsINGESTION AND VALIDATIONSchema ValidationDead-Letter QueueInput SanitizationCORE PROCESSINGDeterministicPipelinesLLM AgentOrchestratorShared State StoreObservability BusSERVING LAYERAPIsAnalyticsDashboardsAlerting
A Resilient Data and AI Architecture

Concrete Takeaways

  • Assume Hostile Inputs. Whether from a database or an LLM, treat all incoming data as potentially malformed. Validate schemas, sanitize inputs, and build explicit logic for handling unexpected values before they enter your core system.
  • Isolate Failures. A single bad record, or a single confused agent response, must never be allowed to halt an entire workflow. Use dead-letter queues and other fault-tolerance patterns to contain the blast radius of inevitable errors.
  • Log for Your Future Self. Your logs and traces must tell a clear story to a sleep-deprived engineer at 3am. Log the intent, the inputs, and the outcome of every critical step. Context is not a luxury; it is a requirement.
  • Master Your Dependencies. Before integrating an API, a library, or a pre-trained model, investigate its failure modes. What happens on a timeout? With malformed input? With ambiguous prompts? This knowledge is more important than the happy-path documentation.
JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.