The systems behind the systems: observability for a one-person platform

The 3 AM alert is a rite of passage for anyone who runs production systems. It is also, in my experience, almost always a design failure. Not a failure of the system being monitored, but of the monitoring itself. I’ve spent years running platforms where I was the first, and often only, person on call. That experience taught me that most alerting setups are actively hostile to the person they’re supposed to help.

They scream about transient CPU spikes or a service restarting on another node. For a solo operator, alert fatigue isn’t just an inconvenience; it's a critical vulnerability. The only way to survive is to build an observability system that has a deep, nuanced understanding of what "broken" actually means.

The systems behind the systems: observability for a one-person platform

The Tyranny of the Single Blip

Early in my career, I built monitoring by focusing on individual component health. A method like Brendan Gregg’s USE Method—monitoring Utilization, Saturation, and Errors for every resource—is an invaluable tool for debugging a specific machine. But I learned the hard way that using it directly for alerting on a distributed system is a recipe for noise. A 99% CPU spike on one of a dozen API servers for thirty seconds is not a crisis. It's often just the garbage collector doing its job.

Paging someone for that is counterproductive. The operator wakes up, sees the system has already self-healed, and goes back to bed annoyed. After this happens three times, they start ignoring the alerts. The real problem is when that CPU spike is correlated with something else: rising latency, an increasing queue depth, or a jump in error codes. The single blip is noise. A pattern of correlated blips is a signal of systemic degradation. The goal is to build a system that can tell the difference.

The Path to Alert Fatigue

Monitoring Work, Not Just Machines

The fundamental shift I made was from monitoring machine state to monitoring the system's ability to perform its contracted work. This isn't a new idea; it’s a pragmatic application of the philosophy outlined in Google's canonical SRE book on monitoring. Their "Four Golden Signals" (Latency, Traffic, Errors, and Saturation) provide the intellectual foundation. For a solo operator, the key is to distill these into a handful of alerts that truly matter.

This means defining health in terms of outcomes:

For a web service: The key metrics are request latency (P99) and the rate of 5xx errors, measured over a five-minute rolling window. An alert only fires if both degrade simultaneously.
For a data pipeline: I monitor the "data watermark." How far behind the real-time wall clock is our processing? A brief spike is fine, but if the lag grows for ten consecutive minutes, it means our ingest is outpacing our processing capacity. That's a real problem.
For an LLM agent system: This is a new frontier where simple success/fail rates are not enough. I track the rate of runs ending in a "final failure" state, but also the semantic integrity of the output, the rate of tool-call hallucinations, and—critically—the token-cost-per-successful-task. A spike in token cost can be as dangerous as a jump in errors.

This approach asks "is the system doing its job?" instead of "is every component perfectly healthy?" The latter is an impossible and useless standard.

Designing the Un-flappable Alert

Over the years, I've developed a small set of rules for creating alerts that I can trust. First, use windows, not instants. My alerts are almost all based on the average or rate of a metric over at least five minutes. This simple change smooths out the vast majority of transient spikes.

Second, use compound conditions. An alert should require at least two correlated indicators of failure. A high message queue depth is only a problem if the rate of message consumption has also dropped. An alert that requires queue_depth > 10000 AND consumer_throughput_rate < 50/s is infinitely more valuable than one that just checks the queue depth. Adopting a standard like the OpenTelemetry semantic conventions makes creating these cross-domain rules much easier.

Third, have two distinct alert severities. A "critical" alert pages me. It means the system is failing for users *right now*. A "warning" alert creates a ticket or sends an email. It means a resource is approaching a dangerous threshold—like disk space hitting 80%—but the system is not yet failing. This respects my focus. Critical alerts are for fires; warnings are for fire hazards.

The Real Goal: Sustainable Operation

Building an observability stack as a solo operator is an exercise in ruthless prioritization. You have a finite budget of attention, and you cannot afford to spend it on false alarms. The goal isn't 100% coverage of every possible failure mode; it's 100% confidence in the alerts you do receive.

The trade-off is that this approach accepts a small amount of risk. By waiting for a failure to become systemic and sustained, I might miss the absolute earliest precursor. I accept that. The alternative—a constant stream of low-signal alerts—guarantees I will miss a real crisis because it will be lost in the noise. This philosophy forces a deeper understanding of the system's architecture. To build good alerts, you have to know what failure actually looks like, not just what a busy server looks like.

A Calm Observability Architecture

Concrete Takeaways

A well-designed telemetry system is calm. It doesn't chatter; it speaks only when it has something important to say. For a one-person platform, that silence is the sound of a system that is not only running, but is also understood.

Stop alerting on single, instantaneous metrics. Base all alerts on data aggregated over a time window of 5-10 minutes to filter out transient noise.
Monitor the work, not the machine. Focus on service-level indicators like error rates and P99 latency rather than low-level metrics like CPU load.
Build compound alerts that require multiple, correlated conditions to be true. A single indicator is often a false alarm; two or three related indicators point to a real problem.
Establish at least two alert severities. One that demands immediate action (a page) for user-facing failures, and another that creates a ticket for conditions that require future investigation.