Why I decided to rebuild my skills from the foundation up
AI
An enterprise architect's personal journey of rebuilding skills for the AI era. Learn why foundational knowledge in math and systems design is key to building durable.
The system was simple on paper. A retrieval-augmented generation pipeline, based on the approach first detailed in a 2020 paper from Facebook AI Research, that took a customer query, fetched documents, and produced an answer. It worked beautifully in demos. In production, it failed in the most peculiar ways—subtly drifting in accuracy, occasionally hallucinating sources with unnerving confidence. The logs showed no errors. The API calls returned 200 OK. Yet, the system was fundamentally unreliable.
That experience was the trigger. For years, I’d built my career on mastering layers of abstraction. But here, the abstraction was a mirage. To fix the problem, I couldn’t just tweak a prompt; I had to understand why certain query vectors were landing in the wrong region of a high-dimensional space. I had to go deeper.

The Leaky Abstraction of AI
For most of my career, progress meant building on top of robust, opaque layers. You didn't need to know TCP internals to build a web app. You didn't need to understand a database's B-tree implementation to write a query. These abstractions were stable. They worked.
AI, particularly large language models, is different. Calling an LLM is not like calling a sorting function. The contract is probabilistic, not deterministic. The "API" of prompt -> completion seems simple, but it hides immense complexity whose failure modes bubble up in unexpected ways. My RAG system wasn't failing because of a bug in the code; it was failing because of the conceptual impedance mismatch between the statistical nature of vector search and the logical precision the application required. The abstraction had sprung a leak.

From API Calls to Vector Math
The first casualty of my old way of thinking was the comfortable distance I’d kept from "the math." As a software architect, you could get very far by focusing on system interfaces and infrastructure. But with AI, the math isn’t just an implementation detail—it is the architecture.
I committed to relearning linear algebra. Not to write new algorithms from scratch, but to build a strong intuition. That specific RAG failure, the one that kicked this all off, was a problem of vector geometry. To debug it, I had to reason about embedding model biases and the true meaning of cosine similarity in our specific domain. Without that foundation, I was just guessing. You can’t design a durable RAG system without it.
This journey wasn't about becoming a machine learning researcher. It was about becoming an architect who could stand on solid ground. It meant that instead of just swapping one vector database for another based on a marketing page, I could reason about their indexing strategies and how they’d perform under our specific data distribution. The foundation provides the "why" behind the "what."
Fusing Deterministic and Agentic Systems
My entire background is in building deterministic systems. You input X, you get Y, every time. An LLM-powered agent is the antithesis of this. Many early attempts at agentic work, like the first versions of Auto-GPT, tried to create autonomy by simply chaining LLM calls together. This creates a cascade of probabilities that is fundamentally unsuited for production.
For the sake of argument, imagine an agentic chain where each of five steps has a 95% chance of success. The math shows that the probability of the entire sequence completing correctly is just over 77% (0.95^5). That’s not a reliable system.
The more robust pattern is a hybrid. It treats the LLM as a powerful but untrustworthy component inside a classic, reliable structure. It’s an idea with parallels to what respected practitioners like Andrej Karpathy have discussed with concepts like an "LLM OS"—using traditional computing to safely harness the model's capabilities. A deterministic state machine defines the valid stages of a task. Within a specific state, an LLM agent is invoked to perform a probabilistic step, like "summarize these documents." Its output is then pushed through a deterministic validation layer. If it fails, the state machine can retry or route it for human review. The agent is contained.
The Durable Over the Trendy
This period of rebuilding has cemented a core belief: durable knowledge is worth more than trendy knowledge. The AI space is awash in new frameworks that promise to simplify everything. Many are useful, but they are also ephemeral.
The time I spent internalizing distributed systems concepts, like those found in foundational texts like Martin Kleppmann's Designing Data-Intensive Applications, was more valuable than learning last year's hot agent framework. The principles of idempotency and fault tolerance allow me to design a system that can safely manage an LLM. Relying on a framework's black-box implementation of "agentic loops" does not. The former outlasts any specific tool; the latter breaks when the tool's abstraction leaks.
The convergence of software, data, and AI means the foundational concepts of all three are now prerequisites. The most effective architects are the ones who can cross these domains, anchored by a deep understanding of the principles that don't change.
What to Remember
For anyone building in this new landscape, my experience suggests focusing on the foundation first. Here are the key takeaways:
- Treat LLMs as Leaky Abstractions: The simple
prompt -> completionAPI hides immense complexity. You must understand the mechanics underneath—vector math, data quality, probability—to debug real-world failures. - Wrap Agents in Deterministic Scaffolding: Don't chain probabilistic calls. Use state machines and validation layers to contain agentic steps, ensuring your system remains auditable and reliable.
- Invest in First Principles: The specific frameworks and models will change. A deep understanding of linear algebra, distributed systems, and data structures is a more durable investment for your career.
- Solve for the Boring Parts: Production reliability isn't about novel agent designs. It's about idempotency, observability, and cost management. The architecture must solve for these first.