jcardena.com Blog The night a region went dark: a disaster-recovery story
145 posts
EN ES

The night a region went dark: a disaster-recovery story

Software

A first-person disaster recovery story about a full cloud region failure. It explores how plans fail under human pressure and technical debt like config drift.

The first sign of a true disaster isn't an alarm. It's the silence that follows. The moment when a hundred frantic alerts suddenly stop, replaced by a void where an entire system used to be. My phone lit up in the dead of night, not with a single failure, but with the coordinated, deafening silence of a whole cloud region going dark.

Denial is the first stage of any outage. My initial thought wasn't that the region was gone, but that our monitoring had to be broken. It's a comforting lie. The reality, confirmed moments later on a public status page, was far worse. We were completely blind. The plan we had on paper was about to meet the friction of a real-world crisis.

The night a region went dark: a disaster-recovery story
The night a region went dark: a disaster-recovery story

The Brittle Blueprint

Every enterprise has a disaster recovery plan. Ours was a classic active-passive design, a pattern that has been sound for decades. A primary region served live traffic, while a secondary, geographically distant region kept a warm standby, receiving a constant stream of replicated data. In theory, a failure meant promoting the secondary systems and executing a DNS change to redirect traffic. It was a clean, logical flow.

Primary RegionServes TrafficData Replicates toStandbyFailure IsDetectedManual DNS Switchto Standby
The Textbook Failover Plan

We had runbooks, checklists, and even tabletop exercises. But these artifacts all share a fatal flaw: they assume a calm, rational operator executing precise steps with perfect information. They don't account for the stress, the confusion, and the sheer bad luck of a crisis in motion.

The night a region went dark: a disaster-recovery story
The night a region went dark: a disaster-recovery story

First Contact with Friction

Our runbook began with "Verify the outage," but how do you verify a total absence of signal? Precious minutes bled away as we debated whether to trigger a failover that would be incredibly disruptive if we were wrong. We had lost our eyes and ears, and the first step of the plan was already a judgment call under extreme pressure.

The next step, the DNS failover, was blocked by human reality. The on-call engineer with the necessary credentials was on a flight, unreachable. The backup was a digital vault requiring a multi-factor handshake from two senior leaders. A crucial half-hour was lost just getting the right people on a conference bridge, coherent enough to perform the digital ceremony. All while customers saw nothing but errors.

The Ghosts of Deployments Past

When we finally flipped the switch, the secondary region lit up, then immediately began to buckle. Our clean failover became a chaotic scramble as a series of hidden weaknesses cascaded into view.

The first ghost was configuration drift. Months earlier, a team had scaled up a service in the primary region using the web console, a quick fix that was never codified. Our infrastructure-as-code said both regions were identical, but they were not. The failover traffic slammed into an underscaled service.

The second was replication lag. We accepted a painful window of data loss because our asynchronous replication, usually seconds behind, had spiked just before the outage. And the third was the sin of hardcoded endpoints—a forgotten internal tool pointing to a regional IP, a third-party webhook aimed at a specific load balancer. Our supposedly region-agnostic system was tied to its primary home in a dozen small, brittle ways.

From Recovery to True Resilience

We recovered, but the experience fundamentally changed my view on resilience. It's not about having a plan to recover; it's about building a system that is continuously resilient by design. The lessons were etched in scar tissue.

First, automate the *decision* to fail over, not just the steps. The most fallible part of our process was the manual gatekeeping under duress. True resilience means the system is designed to route around failure automatically. As the pioneering work of thinkers like John Allspaw on cognitive systems engineering highlights, automation's best role is to support human expertise during a crisis, not demand flawless procedural execution.

Second, your failover environment is a production environment. Full stop. It must get every deployment, every configuration change, every capacity adjustment in lockstep with the primary. It is not a spare tire; it is a parallel universe, ready to become the real one at any moment.

This is where modern practices part ways with simple DR. Chaos Engineering is excellent for finding technical weak spots, but it often misses the procedural and human paralysis we experienced. A resilient system isn't just one that survives a server failure; it's one where the architecture itself removes the need for a hero with the right credentials in the middle of the night.

INGRESS & EDGEGlobal LoadBalancerDDoS ProtectionWeb ApplicationFirewallREGIONAL STACKS (ACTIVE-ACTIVE)Stateless ServicesDistributed DataStoresAgentic RuntimesDeterministicPipelinesREPLICATED STATEObject StorageManaged DatabasesVector StoresEvent StreamsOBSERVABILITY & CONTROL PLANEGlobal MetricsHealth ChecksAutomated FailoverLogicIncidentManagement
A Modern Resilient Architecture

Durability Over Demos

That night taught me that the most critical systems aren't the flashy new agentic models or complex data pipelines. They are the boring, foundational layers of load balancing, data replication, and configuration management. In an era where a single AI-driven feature can depend on a dozen microservices, the blast radius of a foundational failure is larger than ever.

You can't build a durable, modern software practice on a brittle foundation. The hard-won lessons from a region going dark are what allow everything else to function. A great system is one where the recovery plan is so deeply embedded in the architecture that it rarely needs to be read.

JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.