jcardena.com Blog The day data quality became my whole job
145 posts
EN ES

The day data quality became my whole job

Data

A personal story of a silent data failure that reshaped my view on architecture. From a simple ETL plumber to a guardian of data integrity, and why this matters more than ever for AI.

It wasn’t a server crash that woke me up. The pager didn’t scream with a 500 error or a database connection failure. The call came from a director in finance, and his voice was unnervingly calm. He said the quarterly forecast model had produced a number that was not just wrong, but nonsensical. An impossible projection that made a mockery of the entire business.

And all our dashboards were green. The system, by every metric we tracked, was perfectly healthy.

The Plumbing Was Perfect, The Premise Was Wrong

At the time, my team was proud of our data platform. We had built a resilient, high-throughput system for moving transactional data into a central warehouse. We saw our job as being excellent plumbers, focused on uptime and latency. We had monitoring for the infrastructure and alerts for job failures. We treated the data itself as cargo. As long as the container arrived on schedule, we considered the job done.

This was our fundamental mistake. We engineered for system failure, but not for semantic failure.

The problem, discovered after hours of tracing lineage, was a broken assumption. A single transaction_amount field had always been in US dollars. But three days earlier, a bug in an upstream service began sending values in Japanese Yen without a corresponding currency code. Our pipeline, seeing only a number, dutifully ingested ¥10,000 as $10,000. It wasn't an error; it was just a number. A catastrophically wrong number.

Source SystemsETL AssumesCorrectnessData WarehouseNonsensicalForecast
The Flawed Trusting Pipeline

Engineering for Trust

The root cause wasn't code; it was a broken, implicit contract with an upstream team. That incident forced a re-evaluation of my role. An architect's job isn't just to design the pipes, but to ensure the integrity of what flows through them. My responsibility didn't end when the ETL job succeeded. It ended when a decision-maker could trust the number on their screen.

The fix was to change our philosophy. We decided that data would be considered guilty until proven innocent. We built a quality firewall, a series of automated checks that every batch had to pass before it was allowed to merge with our trusted datasets. If any test failed, the pipeline stopped. Silence is better than a lie.

This wasn't about a new technology. It was about discipline. We started encoding our assumptions as code: schema adherence, nullity constraints, value ranges, and freshness checks. The most critical check, which would have caught our currency bug, was monitoring data distributions. If the mean of a key metric suddenly jumps by three orders of magnitude, the system now halts automatically. This philosophy is the foundation of modern tools like dbt tests and open-source libraries like Great Expectations, which treat data assertions as a first-class part of any pipeline.

The Cost of Skepticism

This approach is not free. Being honest about the trade-offs is crucial. Halting a pipeline creates its own class of problems. A VP waiting for a dashboard doesn't care that your data is pure; they care that their report is missing. It requires organizational buy-in to establish that accuracy is more important than freshness.

It also adds latency. These checks take time to run. For real-time streaming data, the "stop the world" batch approach doesn't work. The strategy has to adapt to flagging and quarantining individual records, which is a much harder engineering problem. Implementing a quality firewall is as much a political challenge as a technical one. You have to convince the organization to value trust over speed.

Why This Matters More with AI

That incident happened years ago, but the lesson is more urgent today than ever. We are now building systems where LLM agents consume our data. A bad number in a BI report is embarrassing. A bad number fed to an agent can be actively destructive.

Imagine an LLM agent tasked with dynamic inventory re-ordering. Fed a transaction amount off by three orders of magnitude, it wouldn't just produce a bad report—it might attempt to purchase a thousand shipping containers of a product instead of one. The cost of data quality failures goes from financial miscalculation to operational catastrophe. The feedback loop is faster and the blast radius is larger.

This isn't just my experience. In their work on Data Validation for Machine Learning, researchers at Google detailed how the needs of ML systems force a much stricter standard of data quality than traditional analytics. The boring, deterministic work of a data firewall is the bedrock that makes exciting, agentic work possible and, more importantly, safe.

DATA SOURCESApplicationsEvent StreamsThird-Party APIsFile DropsINGEST AND VALIDATION FIREWALLQuarantine ZoneAutomated DQChecksAgentic AnomalyDetectionReject and AlertLogicTRUSTED DATA COREData WarehouseFeature StoreVector DatabaseSERVING AND CONSUMPTIONDeterministic BIAgentic WorkflowsModel Training
Modern Architecture for Data Integrity

My Job Is to Guarantee the Data

The shift from data plumber to data guardian is a non-negotiable part of modern architecture. My job became harder the day I started caring about the contents of the data, not just the container, but the systems I build became infinitely more durable. The core principles remain the same.

  • Trust is not inherited; it is earned. Never assume upstream data is clean. Verify it.
  • Automate your skepticism. Encode your assumptions about data as explicit, version-controlled tests.
  • Silence is better than a lie. It is always preferable to halt a pipeline than to deliver corrupted data.
  • Observability is for semantics, not just systems. Monitor the shape and distribution of your data as rigorously as you monitor CPU and memory.
JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.