What unifying millions of records taught me about trust
AI
Unifying millions of records reveals a hard truth: data problems are trust problems. Learn how data contracts create the reliable foundation for AI agents.
The meeting began with a question that should have been simple: "How many active customers do we have?" The head of Sales pointed to a dashboard showing one number. The head of Engineering pulled up a chart showing another, nearly thirty percent lower. The support lead, looking at her ticketing system, had a third. Nobody was wrong, but nothing was right. That's the moment a data project stops being about records and starts being about trust.
For years, I've seen teams try to solve this with technology—a bigger data warehouse, a faster transformation tool, a novel "single source of truth" platform. And I've seen them fail. They believed they had a data problem. They actually had a trust problem, and were trying to fix a faulty wire when the foundation itself was cracked.

Every Field is a Political Statement
When you dig into why three systems have three different counts for "customer," you uncover the hidden organizational chart. To Sales, a "customer" is an account in the CRM, a target for revenue. To Engineering, it's a user ID that has authenticated recently, a measure of platform engagement. To Support, it's an entity with a service-level agreement, a unit of operational cost. Each definition is a local truth, optimized for a specific function.
The naïve architect—and I've been him—tries to force a universal definition. We build pipelines to cram all source data into a pristine, canonical model. This is the path to ruin. It ignores that these definitions aren't mistakes; they are the business logic of the organization made manifest in data. A top-down mandate to unify them without consensus feels like a hostile takeover of a department's core metrics. The technical act of transformation becomes a political act of invalidation.

The Myth of the Golden Record
The ultimate expression of this technical-first thinking is the "golden record." The idea is seductive: one perfect, complete, de-duplicated record for every core entity. I've spent cycles trying to build them. What I learned is that the golden record, pursued single-mindedly, is often a fool's errand.
Now, in highly regulated domains like healthcare or finance, a single authoritative view of a customer or patient is often a legal necessity. The trade-off in agility is worth the price of compliance. But for most, by the time you've reconciled all differences and built the complex pipelines to maintain this perfect state, the "source of truth" becomes a brittle bottleneck. Teams inevitably create their own local copies to get work done. You end up right back where you started, only with a more expensive and complicated mess.
The real goal isn't a single, monolithic record. It's a shared understanding and a reliable way to navigate the different contexts. It's less about a golden record and more about a Rosetta Stone—a key that lets you translate between departmental contexts with confidence.
Architecture as a Trust-Building Exercise
If a central model is a trap, what's the alternative? Shifting the architect's role from a builder of systems to a facilitator of trust. The work becomes less about writing SQL and more about brokering agreements. This philosophy shares its soul with Data Mesh, the architectural paradigm Zhamak Dehghani detailed in her work on moving beyond monolithic data lakes. The core idea is distributed ownership: the teams that produce data are best equipped to own and serve it as a product.
We started implementing this with **data contracts**. Each team—Sales, Engineering, Support—became responsible for publishing their version of "customer" as a clean, documented, and stable data product. They guaranteed the schema, the semantics, and the refresh rate. They owned it. They were accountable for it. As detailed in resources like The Data Contract Manifesto, this isn't just documentation; it's an enforceable, version-controlled agreement that makes data a first-class citizen in your engineering culture.
The central data team's job changed. We were no longer janitors. We became librarians, curating a catalog of these well-defined data products. We built platforms that made it easy to discover and join sales_customer with engineering_user. The key was that we never obscured the origin. The context was always preserved.
The Foundation for Agentic Work
This architecture of trust isn't just about better dashboards. It's the absolute prerequisite for building reliable AI systems. An LLM agent tasked with summarizing customer health or identifying at-risk accounts is only as good as the data it consumes. Without trusted inputs, it becomes a high-speed, highly-articulate generator of nonsense.
This is where the two sides of my world—deterministic automation and agentic systems—meet. The data contract is the trust anchor. It provides a guaranteed, deterministic foundation. When an LLM agent queries the `support_account` data product, the contract ensures the data's structure and meaning are stable. The agent doesn't have to guess if `status:active` means the same thing this week as it did last week. That guarantee is the job of the deterministic pipeline that publishes the data product.
The agentic work—the probabilistic, fuzzy task of interpreting behavior and summarizing sentiment—can only be trusted when it operates on a bedrock of deterministic, contract-bound data. Feeding it a swamp of conflicting, undefined metrics is how you get confident, plausible, and dangerously wrong answers. Building data contracts isn't the boring preliminary to the "real" AI work; it *is* the work.
When you're asked to unify data, you're really being asked to unify parts of the organization. Resisting the urge to solve it purely with a bigger database is the first and most important step.
- Start with people, not schemas. Your first step isn't to open a database client. It's to map the organization and find the people who lose sleep when their numbers are wrong. They are your allies.
- Federate ownership, don't centralize blame. Make teams responsible for the quality of the data they publish. A central team that takes all the responsibility becomes a scapegoat for every bad report.
- Build a Rosetta Stone, not a golden record. Your goal is confident translation between contexts, not the erasure of context. Preserve the source, and make the lineage clear and observable.
- Provide the foundation for AI. The most durable data architecture is the one that enables trustworthy automation. Treat data contracts as the non-negotiable bedrock for any serious work with LLM agents.