Caching: the second hardest problem, and I learned it the hard way

The monitoring dashboard was a sea of red. A core service, responsible for assembling user-facing data, was timing out under load. Every request triggered a storm of database joins, and the system couldn't keep up. The immediate answer from the team was unanimous: "We need to cache it." It sounded so simple. It was the moment the easy part of the problem ended and the truly hard part began.

This is a familiar story. A performance bottleneck appears, and the cache is presented as a silver bullet. For a short time, it works. Latency plummets and the database breathes. But what we've really done is trade a predictable, slow system for a fast, complex, and occasionally incorrect one. We've taken on the burden of managing distributed state, often without realizing it.

The Deceptive Win of a High Hit-Rate

In that instance, we put a Redis layer in front of the service. The logic was a textbook cache-aside pattern. On a request, check Redis first. If the data is there, return it. If not, query the database, assemble the object, store it in Redis with a 15-minute time-to-live (TTL), and then return it. The results were dramatic. The P99 latency dropped from over two seconds to under 50 milliseconds. CPU load on the database fell by 80%.

Basic Cache-Aside Pattern

By all metrics, it was a success. We had solved the performance problem. The trouble was, we had created a new, more insidious data consistency problem that was waiting quietly in the wings.

The Ghost of Stale Data

The first bug report was confusing. An operator changed a customer's permissions in an admin tool, but the customer still saw the old state. Fifteen minutes later, like clockwork, the problem would magically resolve itself. The culprit, of course, was our TTL.

The admin tool wrote directly to the database, but the main application read from the cache. The two code paths were now decoupled in a way that violated consistency. This is the first lesson caching teaches the hard way: your cache invalidation strategy is more important than your caching strategy. A simple TTL only works for data that can tolerate being stale. For anything else, you need an active, explicit way to remove a key when its source of truth changes. This is the classic trade-off described in distributed systems theory, most famously by Eric Brewer's CAP Theorem. We had chosen Availability (a fast response from the cache) and sacrificed strong Consistency.

Invalidation as a Systems Problem

Our first fix was to have the admin tool explicitly issue a DELETE command to the correct Redis key. This plugged the immediate hole, but it created tight coupling. As the system grew, more services needed to modify that underlying data. Relying on every developer of every service to remember to manually bust a cache was an architecture built on hope.

This led us to an event-driven approach, a pattern detailed extensively in foundational books like Martin Kleppmann's Designing Data-Intensive Applications. Instead of writers directly manipulating the cache, we used a change-data-capture (CDC) stream from the database. A dedicated "cache invalidator" service subscribed to this stream and was solely responsible for translating data events into targeted cache operations. This decoupled the application services from the caching infrastructure. Any service could write to the database, and the cache would be correctly invalidated as a side effect, without the service needing to know a cache even existed.

The Thundering Herd Awaits

Solving invalidation uncovers the next layer of problems. When a popular, expensive key is invalidated, thousands of requests may arrive in the same instant. They all see a cache miss. All of them proceed to hammer the database with the exact same query, re-creating the original problem. This is the "thundering herd" problem.

The standard defense is a distributed lock. The first process to see a miss acquires a lock, regenerates the data, and writes it back. The others wait briefly for the lock to be released and retry their read. This works, but adds yet another piece of state to manage. You need to handle lock timeouts and retries, a pattern for which the Redis documentation itself provides explicit guidance and warnings. Every layer of optimization adds new failure modes.

What I Build By Now

Having been burned by these ghosts, my approach is grounded in a respect for the complexity involved. Phil Karlton's famous aphorism names cache invalidation and naming things as the two hardest problems in computer science for a reason. These are the principles I follow today.

Event-Driven Cache Invalidation Architecture

Default to no cache. I don't add one until monitoring proves a specific query is a system bottleneck. Premature caching is a particularly nasty form of premature optimization.
Isolate the caching logic. I wrap data access in a Repository or similar abstraction. The application calls getConfig(), and the implementation handles the check/get/set logic internally. The cache can then be changed or removed without rewriting the application.
Survive a cache apocalypse. The system must function correctly, albeit slowly, if the entire cache disappears. A cache is a performance enhancement, not a database. If your app can't run without it, you have two databases.
Prefer event-driven invalidation. For data written by more than one service, relying on writers to manually invalidate keys is too fragile. A CDC-based approach provides a much more durable and decoupled architecture.
Monitor correctness, not just hit-rate. A 99.9% hit rate is meaningless if you are serving stale data that causes errors. The most important metric is whether the system as a whole behaves as intended.

Caching is not a technology choice; it's an architectural commitment. It’s a promise to manage distributed state correctly, and that’s a promise that should never be made lightly.