Rebuilding my identity from 'data' to 'data and AI'
AI
An enterprise architect's journey from traditional data architecture to building modern systems where deterministic pipelines and agentic LLMs cooperate.
For years, my professional gravity was data. My world was schemas and pipelines—deterministic systems built for correctness. Then AI arrived, not as a new tool to learn, but as a new kind of physics with different rules. Integrating it was more than a technical challenge; it required rebuilding a professional identity forged over two decades.
The label "data architect" started to feel incomplete. It described a world of knowable, testable logic. But the systems I build now must handle both deterministic facts and probabilistic reasoning.
From Deterministic to Probabilistic
My comfort zone was the structured world of SQL, ACID compliance, and idempotent operations. The goal was to build durable, boring systems that worked flawlessly at 3 AM when a critical business report was due. When large language models entered the mainstream, my first instinct was to fit them into this old map. This was a category error.
Trying to manage a probabilistic system with the mental models for a relational database is a recipe for frustration. The initial failure I saw—in my own thinking and across the industry—was treating an LLM as a drop-in replacement for a deterministic component. We'd expect perfectly structured JSON every time without building the necessary validation and retry logic. We'd expect factual precision without giving the model the tools to query a real database. The old map led us to expect a familiar reliability from a fundamentally new technology.
The 'Bolt-On' Phase
The next logical step was what I call the "bolt-on" phase. Here, we append AI to the side of existing systems. A classic data pipeline runs, and at the very end, it makes a one-way call to a model. For example, taking structured product descriptions, pushing them through an embedding model, and loading the vectors into a database for semantic search. It works, and it can deliver value.
But architecturally, it’s shallow. The data system and the AI system are strangers. The data pipeline doesn't know what a "good" embedding is, and the AI system can't influence its upstream data sources. This pattern is seductive because it isolates the "weird" AI part from the "safe" deterministic core. It minimizes risk, but it also minimizes potential.
A Cooperative Architecture
The identity shift crystalized for me when I stopped designing one-way flows and started designing cooperative systems. In this model, deterministic and agentic components are peers, each playing to its strengths. It reminds me of Andrej Karpathy's point in his "State of GPT" talk, where he frames LLMs as a fast, intuitive "System 1" brain. For reliable outputs, they need to be coupled with a slow, deliberate "System 2"—which, in our world, is code and deterministic data APIs.
Consider answering a question like, "How did our top product sales in Europe compare to marketing spend last quarter?" A cooperative system works differently than a simple RAG pipeline:
- An agentic orchestrator parses the user's intent. It identifies the need for two specific, factual pieces of data.
- It calls two separate, deterministic tools. The first is a hardened API that executes a SQL query against the sales warehouse. The second queries the marketing spend database.
- These tools return structured, accurate data—JSON objects, not guesses.
- Only now does the LLM do what it excels at: taking these factual inputs and synthesizing a plain-language summary.
Here, the deterministic pipeline provides the facts. The agentic model provides the reasoning and interface. My role is no longer just building the pipeline; it's designing the entire chassis that lets them cooperate safely.
Building for the Seam
This new identity isn't about knowing a few AI libraries. The real craft is in understanding the trade-offs at the seam between these two worlds. It's having the production experience to ask the right questions about any new feature:
- Which part demands absolute, deterministic correctness? Build that with boring code and a SQL database.
- Which part requires flexibility or semantic understanding? That's a job for an LLM-powered agent.
- How will we handle failure when the probabilistic part returns a low-confidence answer? Do we fall back to a deterministic path or ask for clarification?
- What are the cost curves? A complex SQL query has a different cost profile than a call to a powerful proprietary model. Which is appropriate here?
My background in data engineering isn't a legacy to overcome. It is the foundation for building reliable AI systems because I understand the unglamorous truths of data quality, latency, and cost that slick demos ignore. For any data professional feeling this shift, the path forward is to build. Start with a small, integrated system that forces you to confront these architectural seams. That is where the real learning happens.