Identity resolution across millions of records — what it really costs
Data
A practitioner's guide to the real costs of identity resolution. We explore the economics of brute-force matching, deterministic pipelines vs. vector embeddings, and why human review is inevitable.
It usually starts with a simple request: "We need a single view of the customer." The business sees a dozen fragmented systems and imagines a clean table where two different email addresses resolve to a single person. They think it's a deduplication task. I've learned it's one of the most deceptively expensive infrastructure projects an enterprise can undertake.
The N-Squared Cost Problem
The core challenge of identity resolution is combinatorial. To find every possible match in a dataset, you must compare every record with every other one. For 100 million records, that’s roughly 5 quadrillion pairs. Even if a single comparison took just one microsecond, you would need over 150 years of compute time. The cost curve goes vertical before you even start.
You cannot solve this with raw compute. The entire game is about intelligently avoiding comparisons. This isn't a new problem. The statistical foundation was laid out in a 1969 paper, "A Theory for Record Linkage" by Fellegi and Sunter, which still underpins most modern systems. The primary pattern is "blocking"—only comparing records that are already similar in some obvious way, like sharing a zip code and the first three letters of a last name.
This is the central trade-off. Your blocking strategy is a direct tug-of-war between cost and accuracy. A very specific blocking key is cheap to compute but will miss matches where a name is misspelled. A looser key creates more candidate pairs, increasing your compute cost but also your recall. There is no magic answer, only a budget.
Deterministic Pipelines and Their Fragility
Once you solve the brute-force problem with blocking, you hit the second wall: the data itself. No two source systems format an address the same way. One uses St., another Street. Phone numbers arrive with or without country codes, parentheses, or dashes. This isn't just messy; it's actively hostile to deterministic matching.
In my experience, the engineering effort to build and maintain a resilient cleansing and standardization pipeline often dwarfs the effort spent tuning the matching logic. Every time a new data source is added, the pipeline is at risk. A small change in an upstream export format can introduce subtle normalization bugs that poison the entire identity graph. This component requires constant vigilance and represents a significant, ongoing operational cost.
The Modern Alternative: Vector Search
The classic approach is entirely deterministic. The modern, AI-native counterpoint is to use embeddings. Instead of writing rules to compare last names or zip codes, you can use a language model to convert each entity record into a dense vector. You then find duplicates by searching for the nearest neighbors in that high-dimensional space. This approach, explored in papers like Deep Entity Matching with Pre-Trained Language Models, can be incredibly powerful for finding semantic, not just syntactic, similarities.
But it introduces its own set of trade-offs. Generating embeddings for millions of records is computationally expensive. The matching process can feel like a black box; explaining to a compliance officer *why* the model decided two doctors with the same name were a match can be much harder than pointing to a clear rule about a shared medical ID number. It's a powerful tool, but not a free lunch.
The Confidence Threshold Trap
Whether you use deterministic rules or vector similarity, matching is never a binary "yes" or "no." It’s probabilistic. The system produces a confidence score for every potential pair.
Your job as an architect is to decide the thresholds. You might set a high bar, say a score above 0.9, for automatic merges. You set a low bar, perhaps below 0.4, for automatic rejections. This sounds great until you are left with a mountain of pairs that fall in the middle. The "maybes."
This middle bucket is where projects stall. It’s too risky to merge them automatically—a false positive can be catastrophic—but it's also too risky to ignore them. This ambiguity can only be resolved by a human. This means you must build and maintain a "stewardship" application, a UI for data stewards to manually adjudicate matches. That software, and the operational team using it, is a permanent part of the system's total cost of ownership.
Pragmatic Architecture in Production
Building a robust identity resolution system is less about a single magic algorithm and more about defensive engineering. The real work is in managing complexity and cost.
First, treat your candidate generation strategy—whether it's rule-based blocking or vector search—as the primary cost-control lever. Second, invest heavily in your data cleansing and normalization pipeline; it's the foundation of everything that follows. Modern open-source tools like the UK Ministry of Justice's splink library provide excellent frameworks for implementing the classical, explainable approach at scale.
Finally, plan for the "maybe" bucket from day one. A human-in-the-loop workflow isn't a sign of failure; it's a sign of a mature, realistic system that understands the probabilistic nature of identity. The cost of those humans is just as real as the cost of your cloud servers.