Embeddings vs the BI mindset: unlearning what I knew

For years, the most satisfying thing I could build was a perfectly balanced star schema. Facts in the middle, dimensions radiating out, every foreign key locked. It was a universe in a box. The breaking point came while designing a product discovery feature for an e-commerce catalog with thousands of SKUs. User queries like "a jacket for cold, wet weather" were failing against our rigid category tags, and I realized the box I'd built was also a cage.

The World in a Box: The BI Mindset

Anyone who came up through data warehousing knows the primary directive: clarification through categorization. Our job was to take the chaotic stream of business events and make it legible by forcing it into a pre-defined structure. We'd argue for weeks over the grain of a fact table. A product wasn't just a product; it belonged to a `Category`, a `Sub-Category`, and a `Brand` we defined in advance.

This mindset is powerful for providing a shared, stable vocabulary. It powers the dashboards and financial forecasts that run a business. The goal is deterministic truth. Ask for "sales of winter coats in the Northeast for Q4," and you get one, unambiguous number. Every time. But it's a fundamentally reductive process. We choose the dimensions that matter upfront, and in doing so, we discard immense nuance.

From Categorization to Proximity

Where the Rigid Boxes Break

The cracks in that worldview appeared with user intent. A person searching a website doesn't think in our `Sub-Category` codes. They type "something warm for a windy day on the trail," and my BI brain shorts out. How do you answer that with a star schema? The standard approach, a brittle `WHERE description LIKE '%warm%'` query, is a keyword-matching hack. It misses a fleece, a soft-shell jacket, or a thermal base layer that doesn't use those exact words.

The user's intent lives in the semantic relationships between "warm," "windy," and "trail." My first instinct was to fight it by adding more tags and attributes. But that's a losing battle of infinite maintenance. You can't pre-categorize the nuance of human language. The problem wasn't our data; it was the flawed assumption that meaning could be fully contained in a finite set of columns.

Entering Vector Space

My first real work with embeddings felt like a violation of principles. The technical shift, kicked into high gear by the Transformer architecture from Google's paper "Attention Is All You Need," is profound. You take a product description, use a model to turn it into a list of a thousand or more numbers—a vector—and place it as a coordinate in a high-dimensional space.

Here’s the part I had to unlearn: there are no named dimensions. There isn't a `Color` axis or a `Fabric` axis. It's a pure mathematical representation of meaning. "Fleece pullover" is located near "warm mid-layer" not because a human tagged them, but because the model learned this relationship from the data. My job shifted from being a schema definer to a space navigator. The primitive operation was no longer `GROUP BY` but `find nearest neighbors`.

The Hybrid Architecture That Actually Works

The hype cycle declares the old way dead. That's a mistake. The durable architectural pattern that holds up in production is a hybrid where the probabilistic world of vectors and the deterministic world of SQL collaborate. Many modern vector databases provide excellent guidance on this pattern, often called "filtered search" or "hybrid search," as detailed in posts like Pinecone's on the topic.

In this model, the systems have distinct jobs:

The Vector System handles discovery. It takes a fuzzy query, embeds it, and performs an Approximate Nearest Neighbor (ANN) search to return a candidate set of IDs based on semantic relevance. It answers, "What is like this?"
The Relational System handles retrieval and filtering. It takes the candidate IDs and joins them against a traditional database to fetch structured facts: price, inventory, user permissions. It answers, "What are the concrete facts about these specific items?"

Some argue for all-in-one multi-modal databases, but in my experience, separating the mature, deterministic workload from the rapidly-evolving vector workload provides more operational stability. The boring parts matter. You have to manage data consistency between the two systems, and you have to watch latency—an ANN search in a system like Faiss followed by a SQL `IN` clause with 200 IDs has a different performance profile than a single complex SQL query. That trade-off is where the real work lies.

What This Means in Practice

This mental shift changed how I design systems. It's less about a perfect, all-encompassing schema upfront and more about a clean handoff between these two worlds. The takeaways are now hard-coded in my approach.

First, stop adding `tag_1`, `tag_2`, `tag_3` columns to pre-categorize nuance. Invest that effort in a single, high-quality `description_for_embedding` field and let a model find the latent structure.

Second, learn the new killer clause. Your most powerful tool for discovery is no longer `WHERE category = 'X'`, but a function that looks like `ORDER BY L2_distance(embedding, :query_embedding) LIMIT 200`.

Hybrid Search and Retrieval Architecture

Finally, build hybrid systems. Use vector search for the fuzzy front-end of discovery and recommendation, but use a rock-solid relational database for the deterministic back-end of facts and filtering. I still value a clean schema, but I no longer believe it should contain the whole world. The BI mindset gave me a respect for verifiable truth. Embeddings taught me that truth is often found in the messy, probabilistic spaces between the boxes we draw.