jcardena.com Blog Building a warehouse before I knew what a warehouse was
145 posts
EN ES

Building a warehouse before I knew what a warehouse was

Data

A 25-year enterprise architect's honest retrospective on building an accidental data warehouse and the production failures that forced a move to intentional design.

Building a warehouse before I knew what a warehouse was

The alert on my pager—yes, a real pager—went off at 2:17 AM. It was not the production application database, which was humming along fine. It was the "reporting server," a machine that was, in theory, completely non-critical. Except the CFO was in another timezone, preparing for a board meeting, and his dashboard was showing all zeros. That is when I realized the simple collection of scripts and replicated tables I had built had quietly become the company’s source of truth.

We had built a data warehouse without ever intending to. And it was starting to fall apart.

It Started with a Replica

The need was simple. The engineering team had to track user activity without hammering the main transactional database. We wanted a single weekly email summary: new signups, daily active users, top features used. The path of least resistance was to spin up a small PostgreSQL replica and run our analytical queries there.

The first version was a nightly pg_dump and restore. It was crude, but it worked. The data was 24 hours stale, which was fine for a weekly report. We had an isolated copy, our queries did not impact production users, and the report went out every Monday. A success.

But data, once centralized, develops its own gravity. Other teams heard we had a place with all the user activity. "Could you just add a column for the acquisition source?" Then, "Can we join that against feature flag data?" Each request was logical and small. Soon, we were running Python scripts via cron to pull in data from third-party APIs. The "reporting replica" was no longer a replica. It was an integration hub.

Simple ReplicaAd-Hoc ScriptsFragile Hub
The Accidental Warehouse Evolution

Gravity Without a Blueprint

Our architecture did not evolve through design; it was a series of patches. We were feeling the pull of data gravity, a force that architects like Bill Inmon and Ralph Kimball spent their careers trying to channel. Inmon advocated for a highly structured, top-down Enterprise Data Warehouse. Kimball championed more agile, business-focused dimensional models, a concept Martin Fowler explains well in his overview of Dimensional Modeling. We did neither. We had no blueprint, just a growing pile of scripts with hardcoded credentials and minimal error handling.

Each new data source was another cron job. Each new question from the business was another complex query. The performance started to degrade, but slowly at first. A query that took five seconds now took thirty. The dashboard that loaded instantly now had a spinner for a full minute. It was death by a thousand cuts.

The Catastrophe of Nested Views

The real trouble began when we tried to create abstractions with SQL VIEWs. A view to clean up user data. A view to join users with their subscription status. Then, a third view that joined the previous two. In our minds, this was smart. In reality, it was technical debt compounding with interest.

The performance cliff arrives suddenly. The cliff hit when a slight increase in data volume caused the database's query planner, blinded by the opaque layers of views, to switch its strategy. It abandoned a perfectly good index scan and defaulted to a nested-loop join on a multi-million row table it mistakenly thought was small. A five-minute query never finished. We had built a system that was impossible for humans, or the database itself, to reason about.

When Implicit Trust Breaks

The most dangerous phase for an organic system is when it becomes important but is still treated as a side project. The "reporting DB" was now the bedrock for financial projections, but it was managed with no operational rigor. There were no data contracts—no simple agreements between the app team and the data consumers that a schema was a shared dependency.

An application developer, working on a feature, would rename a column in the production database. They would update their application code, run their tests, and ship it. They had no idea they were breaking the entire company's revenue dashboard, because our fragile ingestion scripts would fail silently or, worse, start inserting NULLs. The system we built to create clarity was now creating confusion.

SOURCESApp DatabasesEvent StreamsThird-Party APIsFile StorageINTEGRATION AND STORAGEIngestionPipelinesRaw Data LakeModeled WarehousePROCESSING AND INTELLIGENCEDeterministic SQLBatch JobsLLM AgentsModel TrainingSERVING LAYERBI DashboardsAPIsReverse ETL
An Intentional Data and AI Architecture

Lessons Forged in 2 AM Failures

That 2 AM pager incident was the forcing function. We had to admit what we had built and give it the engineering discipline it required. We started materializing data into clean, well-defined tables instead of relying on nested views. We built data quality tests. We established ownership. We started, finally, to build a real data warehouse. The experience taught me a few durable lessons.

  • Acknowledge the pull towards the center. The gravity we felt is what formal architecture tries to tame. Our mistake was not centralization itself, but centralization without a blueprint.
  • Views are for consumption, not construction. They should be the last step, a convenience for an end-user, not the foundational pillars of a data model. Build on tables and explicit transformations.
  • Your social contract is your real SLA. When people outside engineering make decisions based on your data, it is a product. That transition requires explicit commitments to reliability, quality, and communication.
  • Question the center itself. Today, the conversation has evolved. Thinkers like Zhamak Dehghani, in her work on Data Mesh, question the very premise of a single, centralized warehouse. Her argument is that the organizational scaling problems we experienced are inevitable, and a decentralized approach may be the only sustainable path forward.
JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.