Why uptime became personal the first time a site went down
Software
A personal story about a first production outage and how it shaped a philosophy of architecture for modern data and AI systems, where trust is the key metric.
The phone buzzed on the nightstand with an unnatural urgency. It was just after 2 AM on a Tuesday, the kind of silent hour when any sound feels like a siren. On the other end was a frantic client, the owner of a small e-commerce site I had built. His voice was a mix of panic and accusation. "The site is down. What is happening?"
My first reaction was denial. A quick, fumbling attempt to load the URL confirmed it. Connection timed out. My stomach dropped. This wasn't a bug report. This was a complete, hard failure. In that moment, the abstract concept of "reliability" became intensely personal.

The Anatomy of an Obvious Failure
The system was, by today's standards, laughably simple. It was a classic monolith: one web server, one application, one database, all on a single virtual server. It was a single point of everything. I saw it as a straight line from A to B, easy to manage because all the parts were in one place.
That simplicity was its downfall. The command df -h told the whole story: the root partition was 100% full. Unrotated application logs had silently consumed all available disk space, starving the database until it crashed. The fix was ugly but quick: I deleted the logs, restarted the services, and the site blinked back to life. The whole outage lasted 47 minutes. It felt like a lifetime of reactive chaos.

From Hard Crashes to Hidden Debt
The real lesson wasn't the technical fix. It was that the failure, for all its drama, was at least obvious. The system was down, and there was no ambiguity. Today, the failure modes are rarely so clear-cut. The systems I build now compose deterministic data pipelines with non-deterministic LLM agents, and their problems are far more insidious.
This isn't a new observation; it's a well-documented challenge in our field. A foundational 2015 paper from Google, Hidden Technical Debt in Machine Learning Systems, articulated how ML systems introduce complex feedback loops and dependencies that create silent, creeping failures. An agent doesn't crash; it just starts giving subtly worse answers. A data pipeline doesn't throw a 500 error; it quietly poisons a downstream model with corrupted training data. These are not the loud bangs of a server failing; they are the slow rot of eroding trust.
Reliability is a Promise, Not a Percentage
After the site was back, the client was relieved but shaken. His business stopped because my code stopped. He trusted me, and I had failed. That's when I realized uptime isn't a number on a status page; it's the foundation of a user’s trust. The industry standard, laid out in Google's canonical Site Reliability Engineering book, codifies this with Service Level Objectives (SLOs) and error budgets. This is an essential, powerful model for managing deterministic systems.
But that model is incomplete for the systems we build now. An LLM-powered agent makes a different kind of promise. It promises to "understand" or "assist." When it fails by giving a plausible but wrong answer, it doesn't spend your error budget—it bankrupts your user's trust. This is a more dangerous failure, one that a dashboard of uptime percentages will never capture.
Architecting for Trust at 3 AM
That first outage fundamentally changed how I think about architecture. It’s not about chasing an impossible perfection. It’s about building systems that are legible and diagnosable when you're half-asleep and under pressure. For today's blended data, software, and AI stacks, this means moving beyond classic SRE principles.
My focus shifted to a new set of non-negotiables:
- Instrument the agent's brain. Standard observability shows if a service is up. For an agent, you must log the full context of its decisions: the final prompt, the tools it used, the latency of its reasoning loop, and the user's feedback. Without this, you're flying blind.
- Isolate the blast radius of non-determinism. An LLM is a powerful but volatile component. It should be wrapped in deterministic guards. Use validation pipelines, fallback logic, and explicit checks to ensure that a bad generation doesn't corrupt a core dataset or trigger a catastrophic action.
- Make data lineage a first-class citizen. When an agent provides a wrong answer, the first question is "why?" The answer is almost always in the data it was given. You must be able to trace a bad output back through the model version, the RAG documents, and the upstream data sources that produced it.
The goal is to build systems where failures are not just survivable, but understandable. The architecture itself must guide you from a subtle symptom to its root cause.
That visceral feeling of letting a user down has never left me. It’s the ghost in the machine that pushes for one more validation step, a clearer decision log, or a fallback that ensures the system fails with honesty instead of confidence. We aren't just shipping code; we are making promises, and our architecture is the measure of how well we intend to keep them.