Why I started treating data like a product

The pager went off at 2:47 AM. A critical executive dashboard, the one everyone used for the weekly business review, was showing nonsense. Revenue hadn't dropped 90% overnight. After a frantic hour of digging, we found the cause: an application team, two services upstream, had renamed a column in their production database from order_value to order_total_usd during a routine deployment. No one told us. Our nightly ETL job didn't fail; it just started ingesting nulls, and our dashboard silently broke.

That was the moment I realized our entire approach was fundamentally wrong. We were treating data as a byproduct, a trail of exhaust from the "real" work of building software. It was a problem of incentives, not technology.

Why I started treating data like a product

The Old Way: The Brittle Data Pipeline

The Incentive Gap: Why Central Pipelines Break

The old model, where my team would pull data from other teams' databases, wasn't born from incompetence. It was a product of local optimization. An application team's job was to make their application work. Their priorities were features, uptime, and performance for their users. The data sitting in their database was a side effect, and maintaining a stable, implicit API for a downstream analytics team they rarely spoke to wasn't on their roadmap.

This creates a deep, invisible dependency guaranteed to fail. I've seen the consequences a dozen times. One team owns the data lake but has zero control over the sources. We’d spend weeks building a clean customer_360 table, only to have it silently poisoned six months later when an upstream team started hashing PII for compliance reasons without telling anyone. The result wasn't just a "data swamp," it was a graveyard of abandoned projects and eroded trust, all because of broken ownership.

A Better Model: Data as a Product

The antidote is to treat data as a product. This idea, a cornerstone of what Zhamak Dehghani famously articulated in her foundational work on the Data Mesh, reframes the entire problem. A product has a purpose. It has a provider and a consumer. It has an interface, a version, and a quality guarantee.

When you treat a dataset as a product, you are forced to be specific:

Who is the owner? The team that creates the data is now responsible for it as a first-class deliverable.
Who are the users? Analysts running queries? An ML model in training? A low-latency service? The user defines the requirements.
What is the interface? The "API" for data isn't just a REST endpoint. It's a versioned schema, a delivery cadence, and a guaranteed level of data quality.
What is the SLA? We can now talk about data freshness, availability, and accuracy as contractual obligations.

In this world, the team that renamed order_value wouldn't just be changing their application's code. They'd be releasing a new, breaking version of their OrderEvents data product. It would be a deliberate act, managed through a proper deprecation cycle, just like any other API they maintain.

The Real Costs and the Real Payoff

This approach isn't free. It asks application teams, whose roadmaps are already full, to take on the new responsibility of being a data provider. It requires investment in shared infrastructure like an event bus and a schema registry. The initial cost in time, money, and organizational friction is real and shouldn't be hand-waved away.

But the cost of not doing it is the silent, accumulating technical debt of a fragile system. That fragility is unacceptable now that software, data, and AI are converging. You cannot build a trustworthy autonomous agent or a critical deterministic automation on top of data that is an afterthought. When an LLM-based agent is asked to "summarize the performance of our top five products," its output is only as reliable as the data it consumes. If that data is a product with clear lineage and quality metrics, the agent has a chance of being useful. If it comes from a swamp of unowned, best-effort jobs, the agent is just confidently hallucinating from garbage.

Architecture for Data as a Product

Making It Real: Start with One Contract

The hardest part of this shift is organizational. My first attempt to implement this was met with resistance. The 'Orders' team saw creating a data product as "analytics busywork" that would slow down their feature releases. They weren't wrong; it was more work for them, with no immediate payoff for their own service.

We didn't win by arguing about architecture. We won by showing that their new, clean OrderEvents data product would let them build a fraud-detection feature they wanted, faster than they could by querying other teams' databases. We made the producer the first and most powerful consumer of their own product. That success became the blueprint.

This isn't an all-or-nothing transformation. It's a change in perspective you can apply incrementally.

Start with the consumer. Find a single, high-value use case and identify the critical data it depends on.
Assign a clear owner. The team closest to the data's source must own its delivery. They have context no one else does.
Define an explicit contract. Write down the schema, the delivery schedule, and the quality checks. Make the implicit promises explicit. This contract is your new API.
Instrument and measure. Track freshness, uptime, and schema adherence. Make the health of your data products as visible as the health of your microservices.

Viewing data as exhaust leads to systems that break at 3am. Treating it as a product is the first step toward building something that lasts.