jcardena.com Blog My first on-call rotation and the humility it brought
145 posts
EN ES

My first on-call rotation and the humility it brought

Web

My first on-call rotation, a 3 AM outage, and the cascade failure that taught me humility is a core technical skill in systems architecture.

The pager didn't just vibrate; it shrieked. It was a physical jolt from a shallow, anxious sleep, the kind you only get when you know a machine might need you at any moment. This was my first week on the hook for a large-scale platform, and I was still confident. I knew the diagrams. I had written some of the code. I felt I understood the system.

That confidence lasted until about 3:15 AM on a Tuesday.

The Confidence of a Clean Whiteboard

Early in my career, I saw systems as elegant, deterministic constructs. You could draw them on a whiteboard, reason about them, and predict their behavior. Our core recommendation engine was a perfect example. On the board, it was a clean, sequential flow: ingest user activity logs, run a nightly batch process, and write the results to a key-value store for the front-end to read. Simple.

I could explain every box and arrow. I had confused the map with the territory, assuming the clean lines of the diagram represented the messy reality of production. The operational side felt like janitorial work—important, but not the heart of the craft. I hadn't yet learned that a system's heart isn't its resting state, but its pulse under extreme stress.

Ingest User LogsNightly Batch JobWrite to KV StoreServe from API
The Whiteboard Architecture

The 3 AM Cascade

The first alert was for high latency on the primary user API. Strange. Then another: database connection pools exhausted. Then a third: CPU utilization across the entire web fleet was pegged at 100%. This wasn't a blip; it was a cascade, and nothing pointed to the simple batch job I thought I knew.

The root cause was buried two layers deep. It wasn't a logic bug, but a data-induced one. A single four-byte emoji in a user comment, an encoding our unpatched library parser misread as a signal to allocate a two-gigabyte buffer. The process didn't crash; it just swelled, holding open its database connections. The orchestrator, seeing the job as "running," let it continue, slowly starving every other service of resources. The system didn't break; it choked on a single, invisible character.

Architecture Is How It Fails

Finding the problem took hours. The fix was a one-line dependency update and a manual kill -9. The feeling wasn't triumph; it was a cold, profound humility. My clean whiteboard diagram had no box for "subtle data corruption causes resource leak that starves an unrelated service."

It took me that crisis to learn what Dr. Richard Cook articulated so well in his seminal paper, How Complex Systems Fail. Complex systems are inherently and unavoidably hazardous; they contain failure as a feature. My diagram described the system’s successful state, but its true architecture was defined by its hidden capacity for failure. The real architecture is not the set of components; it's the web of shared dependencies—CPU, memory, database connections—and how they behave when one piece misfires.

From Brittle Theory to Resilient Practice

That incident permanently changed how I build things. Observability stopped being an afterthought and became the first thing I'd build: logs, metrics, and traces are the tools for asking questions of a system on fire. Timeouts, retries, and circuit breakers went from being edge-case logic to the absolute core of any service that talks to another.

Today, this lesson is the foundation for how I architect systems that blend deterministic code with agentic, probabilistic models. An LLM agent generating malformed JSON isn't a hypothetical edge case; it's a statistical certainty. A defensive parser, a dead-letter queue for bad outputs, and strict resource bounds on the agent's process aren't optional features. They are the deterministic wrapper that makes an agentic component survivable in production. You have to build for the certainty of failure.

INGESTION & DECOUPLINGEvent StreamMessage QueueSchema RegistryVALIDATION & SANITIZATIONData SanitizerSchema ValidatorDead-Letter QueueBOUNDED PROCESSINGResource-LimitedRunnerCore LogicTimeout MonitorCircuit BreakerSERVING & OBSERVABILITYProcessedDatastoreObservabilityPlatformUser-Facing APIs
A Resilient Data Processing Architecture

What Stays With You

That baptism by pager informs my work every single day, especially now that software, data, and AI are converging. The probabilistic nature of modern systems only makes these old lessons more relevant.

  • Your system is defined by its seams. The most catastrophic failures happen at the boundaries between services, in the data contracts and shared resource pools. Defend those seams with validation, timeouts, and backpressure.
  • Design for debuggability first. The most brilliant algorithm is a liability if you can't tell what it's doing in production. A simpler, more observable system will always beat a complex, opaque one when things go wrong.
  • Humility is a technical skill. The most dangerous state for an architect is believing you fully understand a complex system. Assume you are wrong. Assume it will fail in ways you cannot predict. Build accordingly.
JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.