Slowly changing dimensions and the slowly changing me
Data
A classic data warehousing pattern, the Slowly Changing Dimension, holds surprising lessons for modern AI/data architecture and for a career built on evolution.
I remember the support ticket vividly. A sales report was "wrong." Revenue for California was down, but a major customer who we knew was a big spender there had vanished from the state's total. They hadn't churned. I dug into the database, and there they were, flagged as a Texas customer. "They moved last month," someone mentioned off-handedly. The system had simply updated their address. California's history was gone, overwritten in an instant.
That was my first real lesson in the physics of data. An update is not just a change; it's a potential erasure of the past. The tool to fight that entropy was something I’d only read about in textbooks, a pattern formalized in works like Ralph Kimball's The Data Warehouse Toolkit: the Slowly Changing Dimension.
The Anatomy of Change: Overwrite or Remember?
At its heart, a Slowly Changing Dimension is a formal strategy for dealing with entities that evolve. That customer who moved is the canonical example. The dimension is the `customers` table. The question is, what do you do about the new address?
The simplest answer, and the one our system defaulted to, is SCD Type 1: Overwrite. The new address replaces the old one. It’s clean, simple, and requires no schema changes. It is also a liar. It presents the present as if it were always the truth. For last quarter's report, our customer retroactively "lived" in Texas, breaking our historical view of the world.
The more honest answer is SCD Type 2: Add a New Row. Instead of overwriting, you expire the old record and insert a new one. You add metadata columns to your table: `start_date`, `end_date`, and `is_current_flag`. When the customer moves, their old California record gets an `end_date`. A new record is created for Texas with a new `start_date` and its `is_current_flag` set to true. History is preserved.
Suddenly, you can query the world as it was. You can ask, "What were our sales in California in Q1?" and get the right answer. You have traded simplicity for truth.
The Real Cost of History
Implementing SCD Type 2 feels like a profound victory for about a week. Then the production realities set in. First, your dimension tables get big. Really big. A customer who moves five times now has five rows. Storage isn't free, and neither are the ETL jobs that have to do this delicate insert-and-update dance. The logic is brittle; a bug in your change detection can corrupt your entire history.
Second, your queries become more complex. Every analyst, every BI tool, every downstream process must now be aware of the historical structure. Every `JOIN` on the customer table must include `WHERE is_current_flag = true` unless you are specifically doing point-in-time analysis. Forget that clause, and you get duplicate rows and wildly inflated numbers. It's an easy mistake to make at 2 AM.
This is the trade-off. Type 1 is easy but wrong. Type 2 is correct but costly. The fundamental tension is between the simplicity of the present and the messy, expensive truth of the past.
From Data Patterns to Career Patterns
Wrestling with these trade-offs early in my career changed how I saw system design. I stopped seeing databases as just a snapshot of the present. I started seeing them as archives of decisions, as histories of behavior. The default should be to preserve, to build systems that assume change is the norm.
This mindset leaked into how I viewed everything else. I began to see my own career in SCD terms. A simplistic, Type 1 view is my current title: "Enterprise Architect." It overwrites the past. It erases the web developer, the software engineer, the data lead who came before.
A Type 2 view is more honest. It sees a series of records: a `start_date` as a junior developer, an `end_date`. A new record as a team lead, then another. Each role is a row in the table of my professional life, with its own context. None of them are truly "expired." They compose the person who is here now. The historical records are the foundation, not the discarded drafts.
Does This Pattern Still Hold in the Age of Events?
It's fair to ask if this is just a relic of the data warehouse era. We have event sourcing and immutable logs now. We can just keep every event forever, right?
The purist view, championed in foundational texts like Martin Kleppmann's Designing Data-Intensive Applications, would argue that materializing stateful dimension tables is an anti-pattern. Just derive state on-the-fly from the immutable log, they'd say. But that ignores the brutal reality of production. Replaying a massive event log to figure out a customer's address for a single query is computationally expensive and slow. At 3am, when a critical dashboard is timing out, "derive on-the-fly" is a terrible answer.
The underlying pattern of separating current state from historical context for performance is more critical than ever. This history is also non-negotiable for AI. Imagine a churn prediction model. A key feature might be "days in current pricing tier." A Type 1 'overwrite' model only knows the customer is in the 'Pro' tier; it has no idea if they just upgraded yesterday or have been there for three years. That historical context, preserved by a Type 2 approach, is the signal. Without it, the model is flying blind.
Principles for Modeling Reality
The technical details of SCDs are secondary. The lasting lesson is philosophical. It's about acknowledging the dimension of time in our work and the systems we build.
- Default to preserving history. Storage is cheaper than regret. The data you discard today might contain the answer to a critical question you don't even know you have yet.
- Make the "current" view easy. While history is vital, the most common queries are for the present. Your architecture must serve this need efficiently. This is the purpose of the `is_current_flag`.
- Recognize that everything changes. Customers move. Products get renamed. People change roles. Build this assumption into the core of your systems, and you'll be ready for the inevitable.
That early, "wrong" sales report was a gift. It forced me to see that the cleanest data models are often the biggest liars. The world is messy, stateful, and constantly evolving. Our job is to reflect that truth, not pave over it with a simple `UPDATE` statement.