jcardena.com Blog Standing up a bare-metal Kubernetes cluster in my house
145 posts
EN ES

Standing up a bare-metal Kubernetes cluster in my house

Web

I built a bare-metal Kubernetes cluster to understand the failure modes AI systems inherit from the cloud. A deep dive into MetalLB, BGP, and Longhorn.

An AI agent hangs, its request to an external tool timing out. The cloud console shows the pod is healthy. The code looks fine. Where is the failure? The answer is often in the plumbing—the layers of networking and storage the cloud makes invisible. To build reliable agentic systems, you have to understand the failure modes of the infrastructure they stand on. The cloud hides this. I wanted to feel it firsthand.

So I built a Kubernetes cluster on a stack of bare-metal machines in my house. Not to save money, but to pay a deliberate tax in friction. The goal: to tear away the abstractions and rebuild my own mental model of how distributed systems actually work, from the physical network card up.

Fragile AI AgentsCloud AbstractionHides RiskBuild a PhysicalModelUncover FailureModes
Rationale: From Abstract Pain to Physical Iron

The Goal: A Tactile Understanding of Failure

This project was an investment in intuition. It’s one thing to read that a managed Kubernetes service handles routing and storage for you; it’s another to spend a weekend forcing a BGP peering session to life. My purpose wasn't to replace a cloud provider, but to deeply understand what they do. This follows the spirit of projects like Kelsey Hightower's Kubernetes The Hard Way, where the point is the journey, not the destination.

When an LLM agent needs to access a vector database for its memory, or a deterministic pipeline needs to write to a durable log, it relies on a chain of services. A failure in any link of that chain—DNS, BGP routing, storage I/O—can cause a silent failure in the application layer. By building the chain myself, I’d be forced to confront every potential point of failure.

Networking: Where Magic Becomes Advertised Routes

The first wall is always networking. In the cloud, requesting a type: LoadBalancer service feels like magic; an external IP address just appears. On bare metal, you get nothing. The cluster is an island, and you have to build the bridge to your network yourself.

The tool for this is often MetalLB, which I configured in its BGP mode. While a simpler Layer 2 mode would have worked by answering ARP requests, using BGP was the entire point—to understand how large-scale routing works. I established a peering session with my router, so when I create a LoadBalancer, MetalLB advertises the service's IP via BGP. My home network now knows exactly which node is handling the traffic for that service.

It took hours of debugging asymmetric routes and peering states. But the payoff is permanent. A LoadBalancer is no longer an abstract resource. It is a concrete BGP announcement. Now, when an agent’s tool call hangs, my mind models the network path: does the CNI policy allow egress? Is the BGP session flapping? The abstraction is gone, replaced by a causal chain I can reason about.

Storage: Where State Gets Real and Replicated

The second wall is state. Pods are ephemeral; an agent's memory or a database's data must be durable. Cloud providers solve this with managed block storage. On bare metal, you have a pile of local disks.

I chose Longhorn to create a software-defined storage layer. It pools the local disks on my nodes and provides replicated, network-attached volumes via a standard CSI driver. When a stateful workload requests a persistent volume, Longhorn carves one out and maintains replicas on other nodes for redundancy. While a more complex system like Rook/Ceph offers more power, Longhorn provided the right balance of capability and educational clarity for this project.

The trade-off became immediately apparent. Every write is now a network call to multiple nodes. This is critical when thinking about AI workloads. An agent retrieving context from a high-throughput vector database is now competing for network bandwidth with the storage system that underpins it. This forces you to design for network saturation, a concern managed services often obscure.

Respecting the Control-Plane Tax

A quiet benefit of managed Kubernetes is a free control plane. The etcd database, API server, and scheduler run on infrastructure you don't see or manage. On bare metal, these are your processes running on your nodes.

I dedicated three nodes as masters to ensure etcd had a stable quorum. The idle resource consumption was a surprise. Before deploying a single agent, the cluster's own deterministic machinery consumed a noticeable slice of CPU and memory. Orchestration is not free. It is a constant, low-level tax you pay for resilience, a cost that is simply baked into the price of a cloud service.

INPUTSUser PromptsExternal APIsData StreamsEXECUTION & LOGICLLM Agent PodsDeterministicPipelinesVector Search PodsSTATE & MEMORYReplicated Volumes(Longhorn)Message QueuesObject Storage(MinIO)CLUSTER SERVICESService Networking(CNI)External Routing(MetalLB)Control Plane(etcd)SERVINGTool APIsWeb UIMetrics API
Agentic Architecture on the Bare-Metal Stack

The Payoff: An Architecture of Fewer Surprises

The cluster now runs a few internal agentic services. But the hardware isn't the asset. The asset is the mental model. When I architect a system in the cloud, I no longer just see the service icons. I see the probable BGP announcements from the cloud router, the CSI driver calls to the storage fabric, and the resource cost of the hidden control plane.

This deeper intuition isn't academic. When a deployed LLM agent inexplicably hangs, my mind doesn't just debug the application code; it instinctively models the potential for storage I/O contention at the CSI layer or a network policy dropping packets at the CNI level. You learn to anticipate the non-obvious failure modes. The demos never show you the mud, but that’s where durable systems are designed.

  • Build to learn, not to produce. Treat a bare-metal cluster as an educational tool to reveal the layers beneath the cloud. The time investment is the point.
  • Master the network. The biggest gap is networking. Learning how traffic gets from the router to a pod via a tool like MetalLB is the most valuable lesson.
  • Plan for state. Data is the hard part. A distributed storage solution like Longhorn isn't optional for serious work; it's a core primitive you have to build.
  • Model the whole system. When building agentic or data-intensive systems, the application is only half the picture. The resilience of the underlying network and storage fabric is the other half.
JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.