jcardena.com Blog Identity resolution: the problem that humbled me
145 posts
EN ES

Identity resolution: the problem that humbled me

Data

Identity resolution is one of the hardest problems in data architecture. A practitioner's guide to moving from deterministic rules to probabilistic systems.

Identity resolution: the problem that humbled me

The first time I personally caused two different people to be merged into a single customer record, I didn’t know it for weeks. The support ticket came in hot: a user was seeing someone else's order history. The bug wasn't in the UI or a caching layer. The error was deep in the data foundation, in a system I had designed to create a "unified customer view." The cause was a faulty assumption about address data. And it taught me a lesson that no whiteboard architecture session ever could.

Identity resolution is the problem that humbled me. It looks like a software engineering task, but it’s really a Trojan horse for data science. It’s the moment an architect’s love for clean, deterministic logic collides with the chaotic, probabilistic reality of the real world.

Multiple RawSourcesDeterministicFirst PassAmbiguousRemainderProbabilisticScoringLinked EntityGraph
From Chaos to a Probabilistic Link

The Seductive Simplicity of the Golden Record

Every Master Data Management and Customer Data Platform vendor sells the dream of the "golden record." A single, authoritative profile for every entity. It’s a powerful vision, and a logical one. But this vision often glosses over the brutal complexity of creating it.

The initial plan is always deterministic. We’ll use an email address as the primary key. Then we discover one person with five emails, and five people sharing one. So we add a phone number. Then a physical address. With each new identifier, we believe we're closing in on certainty. But we're actually just building a web of fuzzy, decaying, and contradictory signals. The dream of a simple SQL JOIN on a clean key evaporates.

From Deterministic Rules to Probabilistic Reality

My first instinct was to build a deterministic rules engine. If email matches, merge. If firstName, lastName, and zipCode match, merge. This works for the simplest cases, the clean matches. But it leaves the vast majority of ambiguous records untouched. How do you write a rule for "Jon Smith" vs. "Jonathan Smyth"? Or for a typo in a street name?

You can't. You have to shift from asking "are these the same?" to "how *likely* is it that these refer to the same entity?" This isn't a new problem. This is the world of probabilistic record linkage, a field formally defined in a foundational 1969 paper by Ivan P. Fellegi and Alan B. Sunter, "A Theory of Record Linkage." It reframes the challenge from writing absolute logic to managing uncertainty. It's less about software engineering and more about statistics.

The Unglamorous Architecture That Works

A real identity resolution system doesn’t seek absolute truth. It seeks a configurable, auditable level of confidence. The architecture I’ve seen hold up in production is a pipeline built on probabilities, and its success hinges on the boring, unglamorous steps.

First is normalization and standardization. Then comes blocking—grouping potentially similar records so you don't have to compare every record to every other. I once brought a system to its knees by choosing postal code as a blocking key, forgetting a new data source had international records where that field was often null or formatted differently. The block size exploded, and the pairwise comparison stage choked. The boring details are what make or break these systems.

Only after that can you do pairwise comparison, using algorithms like Jaro-Winkler for names, to generate a confidence score. Based on a threshold, you then decide: auto-merge, auto-reject, or flag for human review. Those thresholds are the business logic, the dials you turn to balance risk.

The Two Failures That Matter

The entire game is a trade-off between two types of failure. A **false negative** is failing to merge two records that belong to the same person. The cost is a fragmented view and a missed opportunity. It’s not great, but it's rarely catastrophic.

A **false positive** is far more dangerous. This is when you incorrectly merge two different people, like my incident with the order history. You link Jane Doe's data to Jane Smith's account. This can be a privacy breach, a compliance violation, and an operational disaster. This is why the architecture must be designed for correction. Your "golden record" is never final; it is a living hypothesis that the system must be able to revise. It is not a record, but an identity graph with auditable, reversible links.

SOURCESCustomer AppsEvent StreamsThird-party ListsINGESTION & STAGINGRaw Data LakeNormalization JobsBlocking IndexCORE RESOLUTION ENGINEDeterministicRulesProbabilisticModelsHuman Review QueueIdentity Graph DBSERVINGGolden Record APIAnalytics ViewsCRM Sync
Production Identity Resolution Architecture

Three Principles for Surviving Identity Resolution

That journey from a simple whiteboard box to a functioning—and perpetually imperfect—identity system left me with a few hard-won principles. They are less about specific tech and more about mindset.

  1. Define "identity" for a specific use case. The certainty required for a marketing analytics database is profoundly different from what’s needed for a financial or healthcare system. Don't chase a platonic ideal. Define what a "match" means for the business decision you need to make, and tune your system to that specific risk tolerance.
  2. Build for correction, not perfection. The system's output isn't a single table of truth. It's a graph of entities and the probabilistic links between them. Thinking in graphs, as Martin Kleppmann describes in his essential book Designing Data-Intensive Applications, makes it easier to model uncertainty and to trace, audit, and correct the connections over time. Your most important feature is the "un-merge" button.
  3. Accept that it is never "done." An identity resolution system is not a project you complete; it is a core capability you maintain. New data sources appear, patterns of error evolve, and business needs change. It requires stewardship. That simple box on the whiteboard represents a permanent, living, and humbling part of your architecture.
JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.