Refactoring a module I was scared to touch
Web
How to safely refactor a legacy module. A case study on using characterization tests to make a deterministic pipeline reliable enough for a modern AI agent.
There was a module in a system I worked on that we called "the dragon." It was a critical data processing pipeline that had run, untouched, for nearly a decade. The original authors were long gone and the code was a thousand-line monolith of tangled dependencies. The rule was simple: don't touch the dragon.
That rule held until a new requirement emerged: an LLM-based agent needed to use the dragon's output as a trusted source of truth. The agent's work was generative and probabilistic, so it absolutely required its inputs to be stable and deterministic. Suddenly, the dragon's undocumented, brittle nature went from being a technical debt problem to an architectural blocker. My job was to operate on the patient without it bleeding out on the table.
The Real Risk: Brittle Determinism
The fear surrounding this module wasn't just about its complexity; it was about its brittleness. In my experience, a change in one part of a system like this causes an unexpected failure in another. A full rewrite was tempting, but the module was a core dependency for half a dozen services. A big-bang replacement would have been a months-long project with an unacceptably high risk of breaking production in subtle ways.
Another option was to use an approach like the Strangler Fig pattern, which Martin Fowler describes as a way to incrementally replace a system by routing calls through a new facade. This is an excellent pattern for migrating services. But in this case, the dragon was a library, not a service, and my primary goal wasn't just to replace it—it was to make the existing, proven logic understandable and reliable so the new agentic system could depend on it. That required going inside. The strategy had to be containment, not demolition.
A Safety Harness of Characterization Tests
Before I could change a single line of code, I needed a safety net. I couldn't write traditional unit tests because I didn't know what the "correct" logic was supposed to be. The existing behavior, bugs and all, was the de facto specification. The solution comes directly from Michael Feathers' canonical book, Working Effectively with Legacy Code: characterization tests.
The process is disciplined but straightforward. First, I logged a week's worth of real-world inputs being fed into the module in production. Then, for each input, I recorded the exact output it produced—the return value, any database changes, every side effect. Finally, I built a test suite that did nothing more than feed each captured input into the module and assert that the new output was identical to the recorded one. This test suite didn't prove the module was correct. It proved the module behaved exactly as it did yesterday. With this tripwire in place, I could begin surgery.
The First Incision: Extracting Pure Functions
The first step should always be the smallest, most certain win. I scanned the thousand-line function and found a 50-line block that performed a pure data transformation. It took a data structure, applied a few rules, and returned a new one. It had no side effects. It was a perfect candidate for extraction.
My workflow was methodical. I carefully copied that block into a new, separate function. Then I wrote a handful of focused unit tests for just that new function, giving a piece of this logic a clear specification for the first time. I replaced the original 50-line block in the main module with a single call to my new, tested function. Finally, I ran the entire characterization test suite. It passed. The overall behavior hadn't changed, but a small piece of the dragon was now isolated, understood, and protected.
From Opaque Box to Composable Pipeline
What followed was two weeks of patient work, repeating that pattern. I would identify another semi-independent chunk of logic—a validation routine, a data enrichment step—and extract it into its own tested function. Each time, I'd replace the original code and run the full characterization suite to ensure nothing had broken. This method isn't fast, but its power is in its safety and momentum.
With every successful extraction, the original monolithic function got smaller and easier to read. The suite of specific unit tests grew, increasing my confidence. Gradually, the structure of the beast revealed itself. What looked like one monstrous task was actually a dozen smaller, well-defined responsibilities fused together. Separating them made the system more deterministic and far less frightening.
After three weeks, the work was done. The original thousand-line function was now a 100-line coordinator calling out to two dozen small, well-named, and independently tested helper functions. The "black box" was now a clean, composable pipeline whose outputs I could guarantee. The new LLM agent had its reliable, deterministic foundation.
Concrete Takeaways
The most intimidating systems are often just simple systems tangled together. Making them safe for a modern software, data, and AI world is often more about archeology than greenfield invention. Here is what I learned from the process.
- Map Your Architecture: Identify which parts of your system are deterministic (and must be reliable) and which are agentic (and can be probabilistic). The quality of the latter often depends entirely on the quality of the former.
- Build a Safety Net First: Before you refactor, build a harness of characterization tests to lock down existing behavior. Without it, you are flying blind. Credit to Michael Feathers' work is essential here.
- Extract, Don't Rewrite: Isolate small, pure functions one at a time. Each extraction is a small victory that reduces complexity and builds momentum safely. The goal is to untangle, not to detonate.
- Value the Boring Work: This work isn't glamorous. It doesn't produce flashy demos. But turning a brittle, feared component into a durable, understood asset is one of the highest-value activities in engineering. It's the work that lasts.