ChatGPT just launched — why my data foundation means I'm calm
AI
When a new LLM like ChatGPT launches, the frantic scramble to build demos creates technical debt. A calm, ready response comes from a robust data foundation.
The internal chat notifications began to light up. A few pings at first, then a steady stream. A senior leader’s email arrived with a one-word subject: “ChatGPT?” It seemed the entire tech world was having a simultaneous, urgent conversation about what this new capability meant, and how quickly we had to react.
My honest feeling in that moment was calm. Not because I wasn't impressed—the technology was clearly a step change—but because the foundation we’d spent years building was designed for this. When your data house is in order, a powerful new technology isn’t a crisis. It's a new component you can wire into a working system.
The Scramble vs. The Deliberate Integration
In my experience, organizations react to these shifts in one of two ways. The most common is The Scramble. It's a frantic, top-down push to "do something with AI." Engineers are asked to build a demo, fast. They grab whatever data is available—stale database replicas, messy CSVs, an undocumented API. They build a prototype that looks incredible for ten minutes.
The problem is that this demo is brittle. It works with one data source, but when you ask it to correlate that data with, say, customer support tickets, the whole thing collapses because there's no shared identity key. When the business asks to put it in production, the true cost appears: a year of cleanup just to get back to zero.
Now, a quick, throwaway prototype can be a valid tactic to unlock a budget. But it's a terrible strategy for building a capability. The alternative is The Deliberate Integration. This approach treats a new LLM not as a revolution, but as a powerful new consumer of a well-managed data platform. The question shifts from "what data can we find?" to "which of our certified data products should we expose to this new endpoint?"
The Foundation That Enables Calm
For years, the work of building a proper data foundation feels like thankless plumbing. You advocate for a data catalog. You enforce standards for what we now commonly call Data Contracts. You invest in lineage tooling to trace data from its origin. It’s the opposite of hype; it’s the disciplined, boring work of craftsmanship.
This approach aligns with the principles of treating data as a first-class product, an idea articulated well by Zhamak Dehghani in her work on what she calls Data Mesh. When the AI wave hits, this "boring" work becomes a significant strategic advantage. A product-oriented data estate gives you clarity, control, and trust.
When someone asks for a chatbot to answer questions about product usage, we don't hunt for data. We look in the catalog, find the certified "customer product interactions" data product, and know its owner, freshness, and schema. The discovery phase takes minutes, not months.
How a Data Foundation Grounds an LLM
Let's make this concrete. A request comes in to build an internal agent to help engineers troubleshoot production alerts. The Scramble approach would be to dump ten years of wiki runbooks and a vast archive of internal chat logs into a vector database and hope for the best. The result would be an agent that confidently suggests using deprecated libraries and points to engineers who left the company years ago.
The deliberate approach is different. We define a new "Certified Runbooks" data product. The work involves identifying authoritative sources, establishing ownership, and creating a deterministic pipeline that scrubs secrets and structures the content. This curated output is then registered in the data catalog with clear lineage.
Only then do we point our system at it. This pattern, formally known as Retrieval-Augmented Generation or RAG, allows us to ground the model in a small, high-quality library of information. This concept was first detailed in a 2020 paper from researchers at what was then Facebook AI Research. The core work isn't in the AI model; it's in the data engineering. The result is a tool that is not just powerful, but reliable and safe.
Discipline, Not Models, Is the Durable Advantage
The models themselves are improving at a rapid rate, and access to them is becoming a commodity. The real, durable advantage isn't having a cutting-edge LLM. It's having the organizational discipline and technical foundation to connect that LLM to your proprietary data in a way that is safe, reliable, and creates value.
That foundation isn't built in a week. It’s the product of a culture that treats data not as application exhaust, but as an engineered product. The hype cycle will move on, but the organizations that thrive will be the ones that stood on solid ground long before the storm arrived.
Concrete Takeaways
- Focus on Data First. The generative AI gold rush is a data engineering challenge in disguise. A robust data foundation is the non-negotiable prerequisite.
- Treat AI as a New Data Consumer. An LLM agent is just another consumer of your data. It should be subject to the same rules of access, governance, and quality as your BI tools.
- A Control Plane Is Essential. Your ability to control the context fed to a model—via RAG on curated data products—is what separates a production system from a dangerous toy.
- Calm Is a Strategy. Resisting the pressure to "just ship something" and insisting on proper data integration is how you build lasting value instead of technical debt. The boring work is the enabling work.