Master data when everything keeps changing underneath you
Data
Discover a resilient MDM architecture that thrives on change. Learn to decouple ingestion, use a flexible core/sidecar model, and run reconciliation continuously.
It's always a 3 AM pager alert that tells me the truth. The quarterly revenue report is broken. After an hour of frantic digging, I find it: a team refactoring their microservice renamed customer_id to customer_identifier. No announcement, no version change. Just a silent, breaking change that cascaded through a dozen systems, with my MDM platform taking the blame for "bad data."
I've been there more times than I can count. The central promise of Master Data Management is a single, stable source of truth. But the architecture I often see used to build it is incredibly brittle, assuming a world where source systems are static. That world does not exist.
The Fallacy of Static Sources
The classic approach to MDM is built on a fragile assumption: that I can define a perfect, canonical schema and force all incoming data to conform upon arrival. This leads to rigid ETL pipelines with complex transformations. When they encounter a field that has moved or changed types, the entire ingest process grinds to a halt. This treats change as an exception when, in my experience, change is the only constant.
An enterprise is a living system. A merger brings in an alien CRM. A product team launches a new service with its own user table. Designing for this reality means flipping the model. Instead of building walls to keep change out, I need to build systems that expect and absorb it gracefully. The goal is not to prevent change, but to contain its blast radius.
Land Raw Data First
The most effective pattern I've implemented for resilience is to separate receiving data from understanding it. Don't try to parse, validate, and transform a record in the same process that fetches it. This idea is a core tenet of the Medallion Architecture in modern data platforms, separating raw (Bronze) data from cleaned (Silver) data.
The process is simple:
- Land Raw Data. Create a landing zone—a data lake bucket or document store—where I dump the source data exactly as received. I only add metadata: a timestamp, the source name, and a batch ID. This step must be almost impossible to break.
- Parse and Normalize Separately. A second, independent process reads from this raw zone. This is where I apply parsing and validation. This is an application of what Martin Fowler calls the Tolerant Reader pattern, building integrations that are flexible to changes in the data they consume.
When a source team renames customer_id, my ingest pipeline doesn't fail. It lands the new JSON payload successfully. It's the downstream parsing job that breaks. This is a crucial difference. I haven't lost data, I have a perfect record of what changed, and the failure is isolated to a single transformation job I can fix without stopping the flow of data from other sources.
A Lean Core with Sidecar Attributes
The next point of failure is often the master entity model itself. Attempting to create a single "Customer" table with 250 columns to represent every possible attribute from every source becomes a governance nightmare. Every change requires a schema migration on the most important table.
A more durable approach separates the stable from the volatile:
- The Core Entity: This table holds only fundamental, stable identifiers like a master
entity_idand timestamps. It contains no business attributes. - The Attribute Sidecar: All other data lives in a separate, key-value style table with a schema like
(entity_id, attribute_name, attribute_value, source_system, effective_date).
When a new source provides 15 new customer preference fields, I don't need a stressful ALTER TABLE on my core model. I simply insert new rows into the sidecar. The trade-off is clear and must be acknowledged: this can make simple analytical queries painfully slow. The solution is to accept that this architecture requires a dedicated presentation layer, like materialized views, to serve wide, flat tables to BI tools. The gain in resilience is worth the cost of that extra transformation step.
Reconciliation is a Continuous Process
Finally, I have to stop thinking of reconciliation as a monolithic batch job that creates a perfect "golden record." Real-world data is messy. I can't always be 100% certain that two records represent the same person. This problem of probabilistic matching has a deep history, formalized in the Fellegi-Sunter model back in 1969.
A modern engine should operate as a continuous service that produces a lineage graph with confidence scores. For example: "I believe source_A.record_123 and source_B.record_890 refer to master_entity_456 with 92% confidence." This lets me act on high-confidence matches immediately and flag low-confidence ones for human review. As new data arrives, the engine can re-evaluate linkages and increase its confidence, strengthening the entity over time without manual intervention.
The most valuable output of this system isn't just the final record. It's the audit trail explaining *why* the system believes certain records are the same. That lineage is the key to debugging and earning trust.
What to Remember at 3am
Building an MDM system that survives production requires designing for change from day one. It is less about achieving a perfect static state and more about creating a resilient, adaptable system that thrives in chaos.
- Ingest Raw, Parse Later. Your landing zone is your first line of defense against source volatility. Land data exactly as you get it.
- Keep Your Core Model Lean. A small table with stable identifiers is easier to govern than a monolithic one. Use a sidecar pattern for everything else, and plan for the query performance trade-off.
- Make Reconciliation Continuous. Treat matching as a service that produces auditable, confidence-scored links between records, not a black-box job.
- The Lineage Is the Product. The most valuable output is the audit trail explaining *why* the system believes what it does. This is how you build a trustworthy source of truth.