Docker to Kubernetes, explained for a data person
Data
A guide for data engineers and scientists moving from Docker to Kubernetes, explaining the core mental shifts around state, scheduling, and declarative architecture.
My first data pipeline in a Docker container felt like a superpower. All the Python dependencies and system libraries were perfectly packaged. I could run it on my laptop, push it to a cloud VM, and kick it off with a simple docker run. For a time, a cron job on a single, oversized server was all I needed.
This is the world of the single box. It's clean, predictable, and solves the "it works on my machine" problem beautifully. But its simplicity is a trap. It doesn't scale gracefully, and its failure mode is silence—the single VM goes down, and your pipeline is dark until someone notices. The model hits a wall, hard.
From Imperative Commands to Declarative State
It's tempting to view Kubernetes as just "more Docker," but that misses the fundamental shift. You move from giving imperative commands to declaring a desired state. Instead of telling a specific machine how to run a container, you describe the end state you want for the entire system, and hand that definition to the cluster.
You write a manifest—typically in YAML—that says, "I want three replicas of this container running at all times, each with 2 CPU cores and 4GB of RAM." The Kubernetes control plane takes over. Its scheduler finds nodes with enough capacity and places your containers—or "Pods" in Kubernetes terms—onto them. If a node fails, the control plane sees the mismatch between your desired state and reality, and automatically schedules a replacement on a healthy node. The system heals itself.
Of course, this resilience isn't free. That simple YAML file hides a steep learning curve and an operational overhead that is far from trivial. You are trading the fragility of one machine for the complexity of a distributed system, and that is a very deliberate engineering choice.
Decoupling Data from the Machine
For a data person, the first question is always: where does my data live? If my ETL pod can be terminated and restarted on a different machine at any moment, what happens to my temporary files and checkpoints?
Kubernetes forces a more durable architecture. The simple host-volume mount is replaced by a powerful abstraction. You use a Persistent Volume (PV), which is a piece of network-attached storage, and a Persistent Volume Claim (PVC), which is your application's request for that storage. When a pod dies and gets rescheduled to a new node, Kubernetes detaches the underlying disk from the failed node and reattaches it to the new one. From your container's perspective, its /data directory is always there.
This works, but it forces you to understand concepts like storage classes and access modes, which have their own failure modes to debug. The gain in resilience requires a deeper understanding of the storage layer.
The Right Primitives for Data Workloads
Most data work isn't a long-running web service. It's a series of batch jobs: nightly ETL, model training runs, weekly reports. Managing these with a standard web-focused `Deployment` object is clumsy. Kubernetes provides workload primitives designed for this.
A Job runs a container until it exits successfully, perfect for a one-off task. A CronJob manages Jobs on a recurring schedule, providing a resilient, cloud-native replacement for the classic cron daemon. This is the foundation for running serious data platforms. Tools like the Apache Airflow KubernetesExecutor, for instance, use these primitives to dynamically launch each task in its own isolated pod, pulling resources from the cluster as needed.
The Trade-Off: Is It Worth The Cost?
Moving to Kubernetes is not an inevitable upgrade; it's a strategic decision. It's the right move when the cost of your current system's failure becomes higher than the cost of Kubernetes' complexity. For many teams, intermediate solutions like managed services (AWS Batch, Google Cloud Run) are a better fit, offering scalability without the full operational burden.
As the respected engineer Kelsey Hightower has often implied, Kubernetes is a platform for building platforms. If your goal is just to run a few dozen batch jobs, you might not need to build a whole platform. If your goal is to provide a reliable, scalable, multi-tenant foundation for your organization's entire data and AI practice, then the investment starts to make sense.
The Mental Shift, Not Just The Tech
The jump from a single Docker host to a Kubernetes cluster is an investment in operational maturity. It's a trade you make with eyes wide open. Before you start writing YAML, internalize the core principles you're buying into.
- From Single Host to Abstract Computer: Stop thinking about individual VMs. The cluster is the computer; its scheduler is the operating system.
- From Imperative to Declarative: You don't run commands; you define a target state and let the system work to achieve it. This is the engine of resilience.
- From Host Volumes to Persistent Claims: Data persistence must be decoupled from specific machines for any stateful data application to survive.
- From Scripts to Workload Primitives: Use the right tool for the job—
CronJobsfor schedules,Jobsfor one-off runs. This brings correctness to your operations.
The complexity is real, but so are the rewards. The moment you stop getting paged at 3am because a VM ran out of disk space is when you'll understand the value of treating infrastructure like a resilient, automated factory instead of a collection of hand-tended pets.