My first real data pipeline, held together with hope and cron
Data
A 25-year architect's story of a first data pipeline built on cron and hope. It covers the silent failures that followed and the hard-won, non-negotiable lessons.
The green cursor blinked in the terminal, a steady pulse at 2 AM. A shell script, wrapping a few hundred lines of Python, had just finished its first run. It pulled data from a partner API, tidied it into a CSV, and loaded it into our production database. I added the execution line to the server's crontab, set it to run hourly, and leaned back. It felt like engineering: simple, effective, automated. It was also a time bomb.
This is the story of that first real pipeline—the kind that graduates from a notebook into something a business depends on. It’s about the deceptive simplicity of early automation and the foundational patterns the demos so often leave out.
The Architecture of Hope
The business need was simple: get daily sales data from a third-party logistics provider into our system. The provider had a basic REST API, making a quick script seem like the perfect tool. The stack was a masterpiece of pragmatism, or so I told myself. A Python script used requests to fetch data and pandas to flatten the JSON. A second step in the same shell script called psql to bulk-insert the data from a temp file.
The entire operation was kicked off by a single line in crontab -e:
0 * * * * /home/juan/bin/run_hourly_sync.sh >> /var/log/sync.log 2>&1
This was deterministic automation, I thought. No complex orchestrator, no vendor platform. Just pure Linux tooling. What could possibly go wrong?
Where Simplicity Becomes a Liability
For a few weeks, the script hummed along. But cron doesn't care if your logic is sound; it only cares about running the command on schedule. The silent failures began.
First, the upstream API added a new, optional field. My script, expecting a fixed structure, didn't break—it just silently dropped the new data. A week later, a manager asked why a new region wasn't showing up in the dashboard. The pipeline was "working," but it was producing the wrong answer, and nobody knew.
The second failure was catastrophic. A holiday sales spike caused the API call to take longer than sixty minutes. The next cron job kicked off while the first was still running. Both instances tried to write to the same temporary file, corrupting it. They then fought for a database lock, leading to a cascade of failures. The pipeline ground to a halt, piling up zombie processes until the server's load average spiked and took other services with it.
The Non-Negotiable Principles
Cleaning up that 3 AM mess was a clarifying experience. It taught me the difference between a script that runs and a system that is reliable. The principles I learned then are now core to how I design systems.
- Idempotency is about fault tolerance. A pipeline must be safely re-runnable, but it's deeper than that. True idempotency is the foundation for at-least-once processing guarantees. My "overwrite" approach was crudely idempotent but not atomic. A proper solution uses transactions: load into a staging table, validate, then swap into production in one atomic operation. A failed run leaves the production data untouched.
- State must be managed explicitly. Relying on cron's stateless "fire and forget" nature was the critical flaw. A simple file-based lock, like the one provided by the standard Linux
flock(2)utility, would have prevented the concurrent runs. You cannot have two instances of a stateful process running without explicit coordination. - Observability is more than a log file. Redirecting stdout is not observability. I needed metrics. Logging tells you what happened; metrics tell you the rate and duration of what happened. A simple heartbeat or a counter for processed rows, pushed to a monitoring endpoint, would have shown the degradation long before it became a crisis.
The Right Tool for the Job
It’s tempting to frame this as "cron is bad, orchestrators are good," but that misses the architectural trade-off. For non-critical tasks, a simple cron job is often the right choice, perfectly embodying the "You Ain't Gonna Need It" (YAGNI) principle. Prematurely deploying a complex orchestrator for a simple reporting script is its own kind of failure.
The moment a process becomes critical—when the business relies on its correctness and timeliness—the equation changes. The silent failure modes of cron are exactly the problem that modern data orchestrators were built to solve. As Maxime Beauchemin outlined in The Rise of the Data Engineer, the industry needed tools that treated workflows as first-class citizens with explicit state, dependency management, and observability built in.
The wisdom isn't in always choosing the most powerful tool; it's in recognizing the inflection point where the cost of simplicity's hidden risks outweighs the cost of a more robust solution.
From Cron to Craftsmanship
We replaced that script with a system that was still simple, but designed with failure in mind. We used flock to ensure single-instance execution. The monolithic script was broken into discrete, idempotent steps: an "extract" step wrote a timestamped file to a landing zone, and a "load" step picked up that file, loading it into a staging table within a transaction before the final atomic swap. This meant we could re-run a failed load without ever re-hitting the source API.
That fragile pipeline was an essential lesson. The most durable systems are not built on clever code. They are built on a deep respect for failure, using patterns that handle it gracefully. They trade the deceptive allure of initial simplicity for the quiet, boring reliability that actually works at 3 AM.