The migration that moved millions of rows without losing one

There’s a unique kind of silence that falls over a team before they trigger a major data migration. It’s the silence of calculated risk. Years ago, I was the architect for a migration of a high-volume system of record. The mandate was absolute: move everything to a new database architecture without losing a single record or suffering more than a few minutes of planned downtime. The team pulled it off. The secret wasn't a magic tool; it was a deep, operational respect for what could go wrong.

Paranoia as a Design Principle

Most migrations for smaller services can get by with a simple "stop-the-world" approach—take the system offline, copy the data, and point the app at the new database. It’s fast. But for a core system where downtime costs real money, that kind of big-bang cutover is a non-starter. The plan I laid out for the team was built on a foundation of paranoia: run everything in parallel, validate constantly, and make the final cutover the most boring part of the project.

The migration that moved millions of rows without losing one

The Phased Migration Process

The core of the strategy was to run the old and new systems in parallel for weeks. This is a classic technique, a variant of what Martin Fowler calls the Strangler Fig Application pattern. Instead of a single, high-risk event, I designed the migration as a slow, controlled process. For a month before cutover, we modified the application to write every transaction to both the old legacy database and the new one. The legacy system remained the source of truth, but this dual-write phase gave us a live, continuously updated copy of the data in the new environment to test and validate against.

Continuous Validation is Non-Negotiable

How do you prove the new data is identical to the old? Counting rows isn't enough. A single flipped bit in a serialized blob can cause silent, critical corruption. Our validation ran on two levels: aggregate and row-level.

Aggregate Checksums: Every hour, an automated job calculated a checksum across key columns for ranges of records in both databases. This was our fast, cheap, early-warning system. A mismatch was the first sign of drift.
Row-Level Hashing: A slower, more intensive process ran in the background, hashing entire record payloads in both systems and comparing them. Any mismatch triggered an immediate alert for investigation.

This wasn't a one-time check. It was a continuous reconciliation process that ran from the moment dual-writes were enabled until weeks after the old system was decommissioned. This is the unglamorous, essential work of what is now called Database Reliability Engineering. It's computationally expensive, but infinitely cheaper than a data corruption incident.

The Slow Cutover by Degrees

The "big bang" is a myth for any system that matters. We never had a single moment where we flipped a global switch. Instead, using a feature flag system, we started routing a tiny fraction of read requests to the new database. The application was built to "read from new, fallback to old." If the new database returned an error or timed out, the code would automatically retry the read against the legacy system.

We watched the dashboards. Error rates, latency, checksum mismatches. Everything was quiet. So we gradually widened the aperture over the course of a week, increasing the percentage of traffic step-by-step. Each increase was a deliberate decision based on data. By the time we were routing all read traffic to the new database, it had already been handling the full production read load for days. The final cutover, where we promoted the new system to be the source of truth for writes, was an anticlimax. That’s exactly what you want.

The Rollback Plan Is the Real Plan

What if, at fifty percent read traffic, we discovered a subtle data corruption bug? This is the moment your migration lives or dies. Saying "roll back" is easy, but the new database now holds the most current writes. The old system is stale.

Our rollback plan was a detailed, tested procedure, not a vague intention. Because we kept dual-writes running (to the new system as primary, old as secondary), the legacy database was still receiving a live feed. A rollback meant switching 100% of reads back to the old system immediately via the feature flag, running a reconciliation script to replay any missing writes from the new DB into the old, and only then disabling the new system. We never had to use it. But because we had it, we could proceed with confidence. The ability to safely undo a step is what allows you to take the step in the first place.

Parallel Run Migration Architecture

Lessons That Last

A zero-data-loss migration is a grind. It’s a testament to architectural humility and operational discipline. The demos never show the weeks of running parallel systems or the tedious work of reconciling checksums at 3 AM. For any high-stakes migration, these are the patterns that hold up:

Migrate state last. Move application logic, then shift read traffic, and only at the very end, change the source of truth for writes. The thing that is hardest to change back should be changed last.
Your validation is your safety net. Don't just check it once. Build a system of continuous, automated verification that runs before, during, and long after the migration is "done."
Plan the rollback first. A migration plan without a tested, detailed rollback procedure is just a wish list. It deserves the same engineering rigor as the migration itself.

In my experience, the most successful high-stakes project is the one that feels boring on launch day. The lack of drama is the surest sign of craftsmanship.