When I realized tests were a gift to my future self

I remember staring at the dashboard, my stomach sinking. It wasn't a crash or a spike in 500 errors. It was something quieter and more insidious. An LLM agent, designed to summarize complex support tickets, was generating beautifully written, confident, and entirely incorrect summaries. It was hallucinating details about customer histories, and the support team had been acting on them for hours.

For years, I treated testing as a chore—a tax paid to get the CI pipeline green. The tests I wrote were often brittle, focused on implementation details, and felt like an anchor, not a sail. But this silent failure, born at the seam of a deterministic data pipeline and a non-deterministic agent, changed my perspective for good.

When I realized tests were a gift to my future self

The Hallucination a Weekend Unraveled

The system was straightforward on paper. A data pipeline would pull a customer's history, package it into a clean JSON context, and feed it to the LLM agent. A developer, tasked with a small optimization, made a seemingly harmless change to the pipeline. A rarely-used field was removed from the final JSON payload to save a few bytes. The code was deployed on a Friday afternoon.

The pipeline didn't fail; it just produced a slightly different, but still valid, JSON structure. The agent didn't fail; it simply compensated for the missing information by inventing it. The problem wasn't discovered until a senior support manager noticed a pattern of bizarre escalations. The rollback was easy, but rebuilding trust and cleaning up the bad decisions took the entire weekend.

A single, simple schema validation test would have prevented it all. A test asserting, "the context payload must contain field X," would have failed the build instantly. That's when the lesson landed: in a world with AI agents, tests aren't just about logic; they are about rigorously enforcing the contracts that protect non-deterministic components from garbage inputs.

The Two Paths of System Integrity

Executable Contracts for an AI World

The real shift in my thinking came when I had to refactor that same brittle data pipeline months later. I couldn't remember all the subtle edge cases, and the comments were sparse. Before I touched a single line of production code, I turned to a technique I'd learned from Michael Feathers' classic book, Working Effectively with Legacy Code. I wrote what he calls "characterization tests."

These tests don't assert what the code *should* do. They assert what it *currently* does, locking its behavior in place. I wrote tests like `test_produces_correct_json_for_type_b_accounts` and `test_handles_null_input_gracefully`. This suite became my safety net. More than that, it became executable documentation—a specification that could never go stale. When I write a good test today, I am doing a favor for the person I will be in six months. That future developer can make changes with confidence, knowing the test suite guards the system's explicit and implicit contracts.

Designing for Testability on the Data/AI Seam

This experience led me to a core conviction: if a piece of code is hard to test, it's almost always a sign that the code is poorly designed. This principle, popularized by Kent Beck and the Test-Driven Development community, is even more true in the data and AI space. The hardest part of our new systems to test is the non-deterministic agent itself. So how do we manage it?

We focus on the seams. We apply rigorous testing to everything that touches the agent. This thinking aligns perfectly with a durable pattern from software architecture: the "Test Pyramid," famously described by Martin Fowler. While end-to-end tests involving the LLM are slow, expensive, and sometimes unreliable, the unit and integration tests at the base of the pyramid are our primary leverage point.

We can't write a simple test for "good summary," but we can write a thousand fast, reliable tests that verify:

The data pipeline produces a schema-compliant, pristine context.
The API client for the agent has correct retry and timeout logic.
The functions the agent can call (its "tools") are themselves independently tested.

Writing tests first forces us toward a better architecture of decoupled components with clean interfaces. These verifiable contracts are not a nice-to-have; they are essential for building reliable hybrid systems.

The Durable Patterns Still Win

The hype cycle pushes new tools and "agentic frameworks" daily. Yet the real work of building robust systems still relies on principles of durability and craftsmanship. The upfront investment in testing feels slower, but the payoff comes when a CI pipeline catches a contract regression at 3 p.m. instead of a customer finding a hallucination at 3 a.m.

I stopped seeing tests as a chore and started seeing them as one of the most professional, high-leverage activities in modern system architecture. They aren't for a coverage metric. They are a deeply practical act of kindness to your future self and the only sane way to build systems that compose deterministic automation with agentic work. Testability is a signal of good design, and in this new world, it is our most powerful tool for managing chaos.

Architecture for a Testable Hybrid System