jcardena.com Blog Why uptime became personal the first time a site went down
145 posts
EN ES

Why uptime became personal the first time a site went down

Software

A personal story about a first production outage and how it shaped a philosophy of architecture for modern data and AI systems, where trust is the key metric.

The phone buzzed on the nightstand with an unnatural urgency. It was just after 2 AM on a Tuesday, the kind of silent hour when any sound feels like a siren. On the other end was a frantic client, the owner of a small e-commerce site I had built. His voice was a mix of panic and accusation. "The site is down. What is happening?"

My first reaction was denial. A quick, fumbling attempt to load the URL confirmed it. Connection timed out. My stomach dropped. This wasn't a bug report. This was a complete, hard failure. In that moment, the abstract concept of "reliability" became intensely personal.

Why uptime became personal the first time a site went down
Why uptime became personal the first time a site went down

The Anatomy of an Obvious Failure

The system was, by today's standards, laughably simple. It was a classic monolith: one web server, one application, one database, all on a single virtual server. It was a single point of everything. I saw it as a straight line from A to B, easy to manage because all the parts were in one place.

User RequestSingle Web ServerMonolithic AppSingle Database
The Fragile Monolith

That simplicity was its downfall. The command df -h told the whole story: the root partition was 100% full. Unrotated application logs had silently consumed all available disk space, starving the database until it crashed. The fix was ugly but quick: I deleted the logs, restarted the services, and the site blinked back to life. The whole outage lasted 47 minutes. It felt like a lifetime of reactive chaos.

Why uptime became personal the first time a site went down
Why uptime became personal the first time a site went down

From Hard Crashes to Hidden Debt

The real lesson wasn't the technical fix. It was that the failure, for all its drama, was at least obvious. The system was down, and there was no ambiguity. Today, the failure modes are rarely so clear-cut. The systems I build now compose deterministic data pipelines with non-deterministic LLM agents, and their problems are far more insidious.

This isn't a new observation; it's a well-documented challenge in our field. A foundational 2015 paper from Google, Hidden Technical Debt in Machine Learning Systems, articulated how ML systems introduce complex feedback loops and dependencies that create silent, creeping failures. An agent doesn't crash; it just starts giving subtly worse answers. A data pipeline doesn't throw a 500 error; it quietly poisons a downstream model with corrupted training data. These are not the loud bangs of a server failing; they are the slow rot of eroding trust.

Reliability is a Promise, Not a Percentage

After the site was back, the client was relieved but shaken. His business stopped because my code stopped. He trusted me, and I had failed. That's when I realized uptime isn't a number on a status page; it's the foundation of a user’s trust. The industry standard, laid out in Google's canonical Site Reliability Engineering book, codifies this with Service Level Objectives (SLOs) and error budgets. This is an essential, powerful model for managing deterministic systems.

But that model is incomplete for the systems we build now. An LLM-powered agent makes a different kind of promise. It promises to "understand" or "assist." When it fails by giving a plausible but wrong answer, it doesn't spend your error budget—it bankrupts your user's trust. This is a more dangerous failure, one that a dashboard of uptime percentages will never capture.

Architecting for Trust at 3 AM

That first outage fundamentally changed how I think about architecture. It’s not about chasing an impossible perfection. It’s about building systems that are legible and diagnosable when you're half-asleep and under pressure. For today's blended data, software, and AI stacks, this means moving beyond classic SRE principles.

My focus shifted to a new set of non-negotiables:

  • Instrument the agent's brain. Standard observability shows if a service is up. For an agent, you must log the full context of its decisions: the final prompt, the tools it used, the latency of its reasoning loop, and the user's feedback. Without this, you're flying blind.
  • Isolate the blast radius of non-determinism. An LLM is a powerful but volatile component. It should be wrapped in deterministic guards. Use validation pipelines, fallback logic, and explicit checks to ensure that a bad generation doesn't corrupt a core dataset or trigger a catastrophic action.
  • Make data lineage a first-class citizen. When an agent provides a wrong answer, the first question is "why?" The answer is almost always in the data it was given. You must be able to trace a bad output back through the model version, the RAG documents, and the upstream data sources that produced it.

The goal is to build systems where failures are not just survivable, but understandable. The architecture itself must guide you from a subtle symptom to its root cause.

INGESTION & SOURCESUser InputEvent StreamsVector DBsAPIsCOORDINATED PROCESSINGDeterministicPipelinesLLM AgentsData ValidationState StoreOBSERVABILITY & CONTROLDecision LoggingCost MonitoringData LineageModel RegistrySERVING & OUTPUTSStructured APIsAgent ResponsesDashboards
Modern Composed System Architecture

That visceral feeling of letting a user down has never left me. It’s the ghost in the machine that pushes for one more validation step, a clearer decision log, or a fallback that ensures the system fails with honesty instead of confidence. We aren't just shipping code; we are making promises, and our architecture is the measure of how well we intend to keep them.

JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.