When to use an LLM, and when to just write the query

I was in an architecture review last week when a familiar pattern emerged. A sharp engineer proposed a new natural language interface for a critical dashboard. Their diagram showed user text flowing directly to an LLM, which would generate and execute SQL on the fly. The team was excited. I, however, felt that cold dread you get when you’ve seen a specific type of system fail spectacularly at 3 AM.

The temptation to point an LLM at a database and let it handle the queries is immense. It feels like magic. But after building data systems for over two decades, I’ve learned the most painful failures come from using a tool against its fundamental nature.

When to use an LLM, and when to just write the query

The Fork in the Road: Intent vs. Instruction

The core of the problem is a confusion between interpreting intent and executing an instruction. They are not the same job. LLMs are intent engines. They excel at taking a fuzzy, human-flavored input like, "How did our big sellers do in the northeast last quarter?" and guessing what the user probably means.

A SQL query, on the other hand, is an instruction engine. It does exactly what it is told, with perfect precision and repeatability. Its determinism is a feature, not a limitation. When you ask an LLM to write that SQL for you, you're inserting a probabilistic guess into a process that demands certainty. The business user needs to see the exact same number every single time, and an LLM offers no such guarantee.

Intent vs. Instruction Flow

Where LLM-to-SQL Breaks in Production

The demo of a text-to-SQL system always works. The production reality is a cascade of failure modes. I’ve seen this pattern lead to a hallucinated JOIN that brought a production data warehouse to its knees during quarter-end reporting. That's the kind of 3 AM page nobody wants.

Now, some will argue that advanced techniques can mitigate this. They’ll talk about fine-tuning on schema, few-shot examples in the prompt, or validation layers that check the generated SQL. These are clever and can improve accuracy in a lab. But for mission-critical reporting, they don't solve the core problems. You're still left with a system that is non-deterministic by nature, making it a nightmare to debug when a user reports that "the number changed." Trust in the data erodes instantly.

Then there's the performance and cost. Where a hand-tuned query returns in 50ms for a fraction of a cent, I've seen the LLM-to-SQL approach take 3-5 seconds and cost several cents per query. That's a non-starter for a user-facing dashboard. Finally, you have an auditability black hole. Auditing *why* a specific query was generated becomes nearly impossible. You’re left trying to debug a ghost.

A Better Pattern: The LLM as a Router

This doesn't mean LLMs have no place here. We're just giving them the wrong job. The most robust and durable architecture uses the LLM not as a query-writer, but as a highly effective dispatcher. This isn't a novel idea; it's a practical application of a powerful pattern known in the industry as tool use or function calling.

The academic groundwork for this approach was well-articulated in papers like the ReAct paper by Yao et al., which showed how models could synergize reasoning and acting. In production today, this is implemented via tools like OpenAI's documentation on function calling. The pattern is simple: the LLM's only job is to analyze the user's text and choose the right tool from a predefined list. Its output isn't SQL, but a structured call like { "function": "getTopProducts", "parameters": { "region": "NE" } }. That JSON then triggers a hardened, tested function in your backend which executes the efficient, safe SQL your team wrote and maintains. You get a fluid interface with a rock-solid, deterministic core.

When the LLM Is the Right Retrieval Engine

Of course, there are times when using an LLM as the core retrieval engine is the correct choice. The key difference is the nature of the underlying data. If you're working with unstructured data—like summarizing themes from thousands of customer support emails or finding relevant clauses in legal documents—an LLM is the only tool for the job. You can't run SQL against a pile of prose.

Another valid use is for low-stakes, internal exploratory tools for data scientists, where occasional non-determinism can spark new avenues of investigation. And finally, LLMs are fantastic for post-processing. After your deterministic query fetches the data, you can pass the structured results to an LLM to generate a human-readable summary. It isn't doing the math; it's telling the story of the math your reliable query provided.

The Composed Architecture in Practice

The goal is a system that composes these different tools, respecting the grain of each one. The LLM handles the fuzzy, linguistic front-end, while deterministic code and databases handle the precise, logical back-end. This separation of concerns creates a system that is both intelligent and reliable.

Hybrid AI and Data Architecture

The Takeaway: Respect the Tool's Grain

Every technology has a grain—a natural way it works best. Forcing an LLM to act like a relational database is working against its grain, creating systems that are brittle, expensive, and opaque. The durable architecture respects these boundaries. Use LLMs for what they are uniquely good at: understanding the ambiguity of human language. Use databases and deterministic code for what they are good at: providing precise, reliable, and auditable facts. The real craft of modern architecture is not about replacing the old with the new, but about composing them intelligently so each can do its best work.