jcardena.com Blog The release that taught me to fear Fridays
145 posts
EN ES

The release that taught me to fear Fridays

Web

A personal story of a weekend-long outage caused by a 'simple' Friday deployment. It reveals hard-won architectural rules for managing stateful changes.

The pull request was approved. All tests were green. Someone had spoken the dangerous words, "it's just a small schema change," and on a Friday afternoon, they felt true. The deployment pipeline lit up, and ten minutes later, everything looked fine. That quiet satisfaction lasted until my phone rang at 7:00 AM on Saturday.

It was the release that taught me to truly fear Fridays.

The release that taught me to fear Fridays
The release that taught me to fear Fridays

The anatomy of a simple change

The system was a core data platform that processed events, enriched them, and stored them in a central database. This table fed dozens of downstream consumers. The business needed a new attribute tracked, which meant adding one non-nullable column with a default value to our largest table. On paper, it was textbook.

I had a migration script to add the column and a backfill script to populate the default value for a massive number of existing rows. The application code was updated to write the new field. Staging tests were perfect. The mistake was not in the code, but in my understanding of the system's coupling. I treated the deployment as a single, atomic event: ship the code change and the schema migration together. It was efficient. It was also a time bomb.

The release that taught me to fear Fridays
The release that taught me to fear Fridays
Schema ChangeDeployedBackfill ScriptFailsDownstreamPipeline CrashesReporting APIServes Stale Data
The Tightly-Coupled Cascade

The slow cascade of failure

The first alert came from the team managing the primary analytics pipeline. Their jobs were failing with a cryptic database error. Soon after, the reporting API that fed executive dashboards turned unhealthy, serving stale data. The root cause was maddeningly simple. My backfill script, which worked on our well-behaved staging data, failed on a tiny fraction of legacy rows, leaving NULLs where the schema now forbade them.

The application's write path was happily inserting new rows correctly. But any process that tried to read the older, un-backfilled rows hit a poison pill and crashed. The analytics pipeline was the first to find them. Its failure starved the reporting API of fresh data, which in turn failed its health checks. The system was now thrashing, caught in a failure loop caused by a handful of bad rows.

Why we could not just roll back

"Just roll it back" is easy to say. In this case, it was nearly impossible. We had committed a cardinal sin: mixing a code deployment with an irreversible state change. I could roll back the application code in minutes, but the database schema was already altered. New, valid data written since Friday afternoon was already in the new format.

Rolling back the schema would mean either losing that new data or writing a complex "down" migration on a Saturday morning with the entire business watching. That is how you turn a major incident into a catastrophe. The weekend became a 48-hour firefight of writing new scripts, carefully running them on production, and manually restarting the entire chain of services in the correct sequence.

The same risk in agentic systems

This war story is more relevant than ever. The same risk is present in the modern AI stack, but it's often more subtle. Think of a RAG system where your "database" is a vector store and your "schema" is the combination of an embedding model and the chunking logic applied to your source documents.

A "simple" change to an upstream data cleaning pipeline—like a new Unicode normalization rule—is the 2026 equivalent of my NOT NULL column. If this change causes a slight drift in how new documents are embedded, the failure won't be a clean crash. The agent will just start producing bizarre, subtly incorrect answers. It will "hallucinate" with confidence, poisoned by bad data from the vector store. The feedback loop is longer, the failure mode is non-deterministic, and the impact on user trust can be devastating.

SOURCESSource DocumentsUser EventsThird-party APIsINGESTION & PROCESSINGDeterministic ETLEmbedding Modelv1.1Vector StoreLLM Agent LogicSERVING & OBSERVABILITYRAG APIAnalyticsMonitoringDashboards
Modern Agentic System Failure Points

My hard rules for touching production

That weekend fire drill was expensive, but the lessons were priceless. They became a set of non-negotiable rules I've carried with me ever since, whether dealing with a PostgreSQL database or a vector store for an LLM agent.

  • Code and state are deployed separately. This is a practical application of the patterns described in foundational work like Martin Fowler's Evolutionary Database Design. Deploy backward-compatible code first. Then, in a separate, monitored process, perform the state change. Finally, deploy code that makes the new state the default. This decouples the points of failure.
  • A change is not ready until its rollback is. If you cannot articulate, automate, and test the process for undoing a change, you are not ready to deploy it. "We'll figure it out if it breaks" is a recipe for a weekend-long outage.
  • Respect the clock and the calendar. I know this runs counter to some modern DevOps thinking. Data from sources like the DORA State of DevOps Report consistently shows that elite performers can deploy safely any time. But that capability is often built for stateless services. For irreversible, stateful changes to a core data model, the calculus is different. A stable, boring Tuesday gives you the entire team, fresh and focused, to respond if things go wrong.

The goal is not just to ship features, but to operate durable systems. Sometimes the most productive thing you can do on a Friday afternoon is close your laptop.

JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.