SQL patterns I still trust after hundreds of databases

The call came in around 2 AM. A critical billing job had failed, double-charging thousands of accounts before crashing. The root cause was a monster of a stored procedure, thousands of lines of nested subqueries written by someone long gone. It was a masterpiece of cleverness that had become an unmaintainable trap.

That experience burned a lesson into my mind: durability is a function of simplicity. This is more urgent than ever now, as we build systems where deterministic data pipelines must reliably cooperate with probabilistic LLM agents. Fashions in databases change, but a few core SQL patterns have only grown more relevant. They are the bedrock for the new stack.

The Foundation of Reliable Data Prep

CTEs: Legibility for Humans and Agents

The biggest leap in SQL readability was the Common Table Expression (CTE). Using the WITH clause, you build a query step-by-step, giving each logical unit a name, like a series of small helper functions. It’s the difference between a monolithic script and well-structured code.

This is not just a human convenience; it's a requirement for modern AI systems. Whether you're preparing a dataset for RAG or building a SQL-speaking tool for an LLM agent, the logic must be auditable. A tangled nest of subqueries is impossible to debug. A clean sequence of CTEs tells a clear story about how the data was transformed.

But they aren't a silver bullet. Some database engines, particularly older versions, can treat CTEs as an "optimization fence," refusing to push predicates past them and leading to dreadful query plans. In these rare cases, or when you need to repeatedly reference a complex intermediate result, a temporary table might still be the more pragmatic choice. The goal is clarity and performance, not dogmatic adherence.

Window Functions: The Engine of Feature Engineering

Before window functions, trying to compare a row to others in its group—like finding a customer's last order—required tortured, slow self-joins. Window functions like LAG(), ROW_NUMBER(), and SUM() OVER (...) let you perform these calculations cleanly and efficiently, without collapsing the rows.

Today, this pattern is the workhorse of feature engineering. Need to calculate a user's average session time over their last five visits? Need to rank documents by relevance within a specific category for a search result? These are the exact kinds of contextual features that make AI models powerful, and window functions are how you generate them efficiently, right inside the database.

The trade-off is memory. A PARTITION BY clause on a high-cardinality key can consume significant resources as the database holds the entire partition in memory to perform the calculation. For massive-scale analytics, pre-calculating these features in a batch data pipeline can be a more durable pattern than running them on-the-fly against a live database.

Idempotency: A Non-Negotiable for Automation

In any system where software components talk over a network, you have to assume requests will fail and be retried. This is especially true for LLM agents, which might retry a tool-call after an unrelated fault. If an agent calls your API to create a user and retries, you can’t create two users.

The solution is an idempotency key, a unique identifier the client generates for an operation. Your API endpoint tries to insert a record with this key, which has a UNIQUE constraint. If it fails because the key already exists, your code catches the specific error, looks up the original record, and returns a success response. The operation happens exactly once. For a masterclass in this, look at the Stripe API documentation on idempotent requests; they set the standard for building reliable, distributed systems.

This isn't just for APIs. Any deterministic automation, from webhook handlers to data ingestion pipelines, needs this to be truly reliable. The only operational catch is you may eventually need a process to clean out old idempotency keys from your database.

Core Entities and Event Streams: The Ultimate Source of Truth

The most resilient schema I use separates two concepts: stable "core" entities and append-only "event" streams.

Core Entities: A customers table with an ID and name. It holds the current state.
Event Streams: An immutable log of everything that happens. order_placed, payment_processed, email_sent. You never UPDATE an event. A refund is a new refund_processed event.

This separation provides a complete, auditable history of your system. This event log is the perfect source of truth for training and fine-tuning AI models. You can replay history to reconstruct state at any point in time, debug agent behavior, and generate pristine training data that reflects the true sequence of actions.

However, this pattern, often called Event Sourcing, introduces real complexity. You now need to build "projectors" that read the event log to create the current-state views your application needs. As the canonical writer on software patterns, Martin Fowler's work on Event Sourcing is the best place to understand the deep trade-offs. It's an incredibly powerful pattern, but it isn't free.

Architecture for Hybrid AI Systems

Concrete Takeaways

These patterns are the ligaments of a durable architecture. They connect the raw data to the intelligent systems we build today.

Audit Your Queries: Find a complex query in your system. Refactor it with CTEs and ask if its purpose is clearer. This clarity is essential for debugging the data pipelines that feed your AI.
Look for Feature Gaps: Identify where your application could benefit from richer context. Can you use a window function to calculate a time-series feature that would make a model smarter?
Enforce Idempotency: The next time you build an API endpoint or a background job that writes data, add an idempotency key from day one. Treat it as a non-negotiable part of the contract for any automated process.
Separate State from History: Look at your schema. Can you distinguish the stable "nouns" from the immutable "verbs"? Drawing that line is the first step toward a more auditable system and a higher-quality source of data for AI.