jcardena.com Blog The ETL job that ran for six hours and I had to make it twenty minutes
145 posts
EN ES

The ETL job that ran for six hours and I had to make it twenty minutes

Data

An enterprise architect's breakdown of a real-world performance disaster. How a six-hour ETL job was cut to 20 minutes by replacing anti-patterns with durable principles.

The alert wasn't for a catastrophic failure. It was for a silent decay. A key executive dashboard was blank because the data pipeline, meant to finish by 6 a.m., was still chugging along six hours later. This is the kind of slow-burn emergency that tests an architect—not a crash to be rebooted, but a design flaw meeting a scale it can no longer handle.

In a world focused on agentic AI, this kind of problem seems dated. But it's the absolute foundation. An LLM agent can't reason on data that never arrives. My task was to fix this deterministic pipeline and make it boringly, reliably fast.

Python ScriptStartsSELECT * fromSource DBProcess All Rowsin MemoryLoop Row-by-RowINSERT
The Original Six-Hour Monolith

The First Bottleneck: Pulling the Entire Ocean

The original script began with a single, massive query against the production transactional database. It was a classic SELECT * with a few joins, pulling hundreds of gigabytes over the network into the memory of a single worker. This front-loaded the entire job with a multi-hour data transfer and put a long-running, heavy read load on a database serving live traffic.

The process couldn't transform a single row until the last byte arrived. The fix was to stop treating extraction as a monolithic event. I replaced it with an incremental, batched pull. The new script would query for a specific range of records—say, 100,000 rows at a time based on an indexed timestamp—process that batch, and then loop. This broke the memory pressure and turned one opaque transfer into a series of observable, manageable steps.

The Memory Trap of In-Process Transforms

With data flowing in chunks, the next bottleneck was obvious. The script loaded each batch into a Pandas DataFrame to perform complex joins and aggregations. Pandas is a phenomenal tool for analysis and data manipulation that fits in memory. This dataset did not.

The worker was thrashing, spending more time on memory management than on computation. It's a fundamental mismatch of tool and task. The principle here is a durable one: move the computation to the data, not the data to the computation. This isn't just my opinion; it's a core concept of data systems design, brilliantly explained by authors like Martin Kleppmann in Designing Data-Intensive Applications. The target data warehouse was built for this work. The solution was to stage the raw data in the warehouse and replace the Python logic with a single, clean SQL statement. The transformation time dropped from hours to minutes.

Death by a Thousand Inserts

The final disaster was the loading phase. The script iterated through its transformed data, issuing a separate INSERT statement for every single row. For millions of rows, this kills a database with transactional overhead, lock contention, and log writes. It’s the most common and easily fixed ETL anti-pattern.

Databases are optimized for bulk operations. Because the transforms were now happening inside the warehouse, the fix was an atomic INSERT INTO ... SELECT ... statement. If the transforms had remained external, the right pattern would be writing batches to a file and using a native bulk-loading utility, like the well-documented COPY command in PostgreSQL. The effect is the same: one single, highly optimized operation instead of millions of tiny, expensive ones.

The New Assembly Line

The six-hour monolith became a simple, three-stage assembly line. A small script extracts and lands batches of raw data. A SQL runner transforms it inside the warehouse. The "load" is now part of that transform. This separation of concerns is a cornerstone of modern data practice, as the team at dbt Labs often discusses. The new architecture isn't just fast; it's observable, restartable, and durable.

SOURCESTransactional DBEvent StreamsSource FilesSTAGING & PROCESSING LAYER (DATA WAREHOUSE)Raw Data StagingSQL TransformationLogicBulk LoadOperationsOrchestrationSERVING & APPLICATION LAYERAnalyticsDashboardsAPIsAgentic Systems
The Sub-20-Minute Decoupled Pipeline

Concrete Takeaways

This kind of optimization work boils down to a few core principles that have held true across decades of platform shifts.

  • Profile I/O First. Before you optimize a single line of business logic, understand how data is moving. Network and disk are often the silent killers of performance.
  • Push Computation to the Data. Use the tool designed for the job. Don't build a miniature, inefficient database engine in your application code if you have a real one ready to do the work.
  • Always Use Bulk Operations. Never load a dataset by looping and inserting one row at a time. The overhead will destroy your performance and your database.
  • Decompose Monoliths. Break down large, opaque jobs into smaller, observable stages. You gain speed, but more importantly, you gain reliability and the ability to debug failures quickly.

The Boring Foundation for the Exciting Future

The total runtime dropped to under twenty minutes. This isn't just a story about legacy ETL. It's about the non-negotiable bedrock required for the systems we want to build next. That reliable, 20-minute job is what ensures the data warehouse is fresh for the LLM agent that needs to query it. Without this craftsmanship in the deterministic layers, any agentic system you build is operating on a foundation of sand. The boring work makes the magic possible.

JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.