Learning to model data for questions nobody had asked yet

The request always sounds simple. “Can you pull last quarter’s sales to new customers in the Northeast, broken down by product category?” I remember hearing a version of that early in my career and thinking it would take an hour, maybe two.

Half a day later, I was still wrestling with a monster SQL query. A dozen joins, subqueries to define a "new" customer, and convoluted logic to handle address history. The query was slow, fragile, and if the business asked a slightly different question tomorrow, I’d have to start over. The problem wasn't the database; it was that we were storing data without preparing it to be questioned.

The Transactional Trap

Our application database was perfectly normalized—a textbook example designed to run the business efficiently. It was optimized for writing data with integrity, not for reading it with context. This is a classic problem in data architecture: the transactional model that excels at processing orders is a nightmare for analytics. Business context is spread across dozens of tables, and every analytical query must reassemble it on the fly.

You're forcing the query engine, and the analyst, to re-learn your business logic every single time a question is asked. This approach doesn't scale.

From Transactional to Analytical Model

The Durable Wisdom of Dimensional Modeling

The breakthrough comes when you stop modeling your application's objects and start modeling the business's events. This isn't a new idea; it’s the core principle of dimensional modeling, a discipline largely defined by Ralph Kimball and Margy Ross in their canonical book, The Data Warehouse Toolkit. Their patterns have survived decades of platform shifts for a reason: they work.

The most famous pattern is the star schema. It organizes data into two simple types:

Facts: The measurements of a business process. A sale, a shipment, a web visit. The fact table is long, narrow, and contains numeric values and keys.
Dimensions: The context that gives facts meaning. Who, what, when, where. The customer, the product, the date. Dimension tables are wide, descriptive, and surround the fact table like points on a star.

By transforming transactional data into a star schema, you perform the hard work of joining and context-building once, during the loading process. The subsequent analytical queries become simple, fast, and intuitive.

Architecture is About Trade-Offs

Kimball’s approach, often called "bottom-up," focuses on building dimensional models for specific business processes. It’s an earned opinion, but not the only one. The other giant of data warehousing, Bill Inmon, advocated for a "top-down" architecture. His view, detailed in his Corporate Information Factory framework, is to first build a centralized, normalized Enterprise Data Warehouse as the single source of truth, then spin off dimensional models for analysis.

Today, in a world of data lakehouses, some argue that schema-on-read tools like Spark or DuckDB make pre-built star schemas less necessary. You can model the logic in code, perhaps using a tool like dbt, without physically structuring the tables. But even then, the logical principles of dimensional modeling—separating facts from dimensions—remain the most effective way to think about and structure analytical code. The pattern endures even as the physical implementation changes.

Why This Old Pattern Matters for New AI

This brings us to the convergence of data, software, and AI. Why should we care about a 30-year-old modeling technique when building systems with LLMs? Because the "garbage in, garbage out" problem is amplified a thousand times with AI.

A well-modeled dimensional warehouse is an incredible asset for modern AI work:

High-Quality RAG: For Retrieval-Augmented Generation, you need clean, structured context. A star schema provides exactly that. Facts about an event and rich, descriptive dimensions are far more valuable context for an LLM than a messy collection of transactional tables.
Reliable Feature Engineering: Machine learning models depend on stable, well-defined features. A dimension table like DimCustomer is a natural feature store, providing pre-calculated, trustworthy attributes like customer lifetime value or days since last purchase.

Feeding an LLM agent raw transactional data is asking for unreliable, expensive, and slow results. Feeding it data from a dimensional model is giving it a curated, pre-digested view of the business—a foundation for agentic work that is both deterministic and trustworthy.

Modern Data & AI Architecture

Concrete Takeaways

Modeling for future questions isn't about predicting the future. It's about choosing a data structure that is inherently flexible and ready for interrogation by both humans and machines.

Separate OLTP from OLAP. Your application's database and your analytical system have different, conflicting jobs. Don't run analytics on your production database. A dedicated warehouse is non-negotiable.
Embrace durable patterns. The star schema is a battle-tested foundation. Understand the principles from the source—Kimball’s work—before reaching for the newest technology.
Understand the philosophies. Recognize the architectural trade-offs between different approaches, like Kimball vs. Inmon. Knowing the history informs better decisions for the future.
Build the AI foundation. A clean, dimensionally-modeled data warehouse is not a legacy concept. It is the single best way to prepare high-quality, context-rich data to reliably power your AI and agentic systems.