The data-pipeline playbook I rebuild for every new project
Data
A veteran architect's playbook for building data pipelines that last. Learn why declarative, idempotent principles are the essential foundation for modern AI systems.
The first time a failed midnight job corrupted a production table because a script wasn't idempotent was the last time I started a project without a playbook. It’s a lesson paid for in stress and emergency patches. Every new project still whispers the same temptation: just a quick script, we’ll formalize it later. I’ve learned "later" is where systems go to die.
So now, I start every project by implementing the same core patterns. It feels deliberately slow for the first week. But it produces a system that is predictable, testable, and—most importantly in the age of AI—provides the stable foundation required to manage inherently unpredictable components.
Declare the What, Not the How
The fundamental move is to stop writing pipelines as a sequence of imperative scripts. Instead, I define every unit of work as a declarative configuration, usually in a simple format like YAML. The code becomes a generic, reusable engine that reads this config and executes it. This isn't a new idea; it’s a core tenet of a philosophy that people like Maxime Beauchemin articulated in his writing on Functional Data Engineering.
Instead of a script with hardcoded endpoints, the work is described in a config file. The engine that runs it is dumb on purpose. It knows how to perform a registered set of tasks—fetch from an API, flatten a structure, write to an object store—but all the business-specific logic lives in the configuration. The system now scales by adding new, version-controlled config files, not by copying and modifying fragile code.
Idempotency is Non-Negotiable
If you only take one idea from this, make it this one: every task must be idempotent. You can run it once or one hundred times with the same inputs and get the exact same result. This is the bedrock of reliability. Failure is a given; networks will glitch, services will fail. An idempotent design makes a retry a safe, automatic operation, not a frantic manual intervention.
I enforce this with two main patterns:
- Atomic Writes. Never modify data in place. Write all output to a temporary location. Only after the write succeeds do you perform an atomic "move" or metadata swap to make it the live version. This prevents partial runs from corrupting a production dataset.
- Deterministic Partitioning. Key outputs by their inputs. For time-series data, this is usually the date of the data being processed (e.g.,
/path/to/data/YYYY-MM-DD/). A rerun for a specific day safely overwrites just that day’s data.
This discipline feels like a tax on day one. It pays for itself the first time you can confidently rerun a week of jobs and know you won't make the problem worse.
The Stable Foundation for Unstable Agents
This deterministic playbook is more critical now than ever. The modern stack isn't just about transforming structured data anymore; it's about feeding and managing stochastic, non-deterministic LLM agents. Trying to build a reliable AI system on a foundation of brittle, one-off scripts is a recipe for disaster.
A reliable, idempotent data pipeline is the prerequisite for sanity in MLOps and LLMOps. How do you consistently prepare and chunk documents for a RAG pipeline? With a version-controlled, declarative pipeline. How do you ensure the data feeding your vector database is fresh and correct? By running idempotent jobs that can be safely retried. How do you trace an agent’s hallucination back to its source data? By having perfect data lineage, which is a natural outcome of this architecture.
The "boring" pipeline becomes the bedrock of reliability that allows you to experiment safely with the chaotic, creative power of agents at the layer above.
The Trade-Off: Portability vs. Velocity
This playbook deliberately avoids vendor lock-in. The central data store is treated as commodity object storage (like S3 or GCS) holding open formats like Apache Parquet. The orchestrator (like Airflow or Dagster) only manages the dependency graph, triggering isolated containerized tasks. None of the core logic lives inside a proprietary platform feature.
This prioritizes long-term portability over initial velocity. It's an explicit trade-off. For a team that needs to ship a product this quarter, choosing a tightly integrated, managed ecosystem like dbt Cloud with Snowflake can be a perfectly valid engineering decision. They are trading the freedom to swap components later for the speed of using a cohesive, opinionated platform now. My playbook is for projects where durability and the ability to evolve the stack over five or ten years is the primary concern.
The Playbook Distilled
This isn't about a specific technology. It’s an architecture that defends against complexity and entropy over time. When I start a new project, these are the guardrails I put up on day one:
- Define all work in declarative, version-controlled config files. Code is the engine; config is the instruction.
- Enforce idempotency as a hard requirement for every task. If I can't safely run it twice, it's considered broken.
- Use a commodity object store with open formats as the single source of truth. The data should outlive the tools.
- Keep the orchestrator dumb. Its only job is to manage the dependency graph, trigger tasks, and handle alerts.
This approach doesn’t produce the cleverest demo in the first week. It produces a system that still works, and that you can still reason about, on day one thousand.