Real-time vs nightly: choosing latency honestly

The meeting is going well. We’ve aligned on the goals for the new analytics platform. Then someone, usually with the best of intentions, says the magic words: “And of course, this all needs to be real-time.”

A silence hangs in the air. Every engineer in the room does the same mental calculation. The project instantly forks into two futures: one where we build a durable system, and another where we chase a fashionable requirement that adds a stack of new, always-on distributed systems that must be managed and debugged.

The Siren Call of Instant Data

I understand the appeal. Demos with dashboards that flicker and update with millisecond latency are impressive. They create a powerful illusion of control. This streaming-first view has a strong intellectual foundation, articulated perfectly by Jay Kreps in his essay on “The Log” as a unifying abstraction for data. For a tiny subset of problems—fraud detection, ad bidding, inventory control—this immediacy is the entire business function.

But for most enterprise use cases, it's a trap. The desire for "real-time" is often a poorly articulated proxy for "I want data that is trustworthy and accessible when I need it." A nightly batch job that runs reliably at 2 AM and populates a report for a 9 AM meeting is a perfect alignment of latency, cost, and business process. The decisions being made happen on a human timescale, not a machine one.

The Latency And Complexity Spectrum

Pushing every system toward the right end of that spectrum by default is an architectural mistake. It’s like insisting on a Formula 1 car for your daily grocery run. It’s expensive, fragile, and entirely the wrong tool for the job.

The Real Cost: The Complexity Tax

When you sign up for a real-time system, you are buying into a fundamentally different paradigm. This introduces a "complexity tax" you pay at every stage.

Architectural Shift: Simple, deterministic batch jobs on a scheduler are replaced with an always-on ecosystem. You now need an event bus like Kafka, a stream processor like Flink, and ways to handle state, out-of-order events, and reprocessing. Each component is a new potential point of failure.
Operational Burden: Debugging a failed batch job is straightforward. Debugging a streaming pipeline at 3 AM is a different beast entirely, a challenge detailed in foundational texts like Martin Kleppmann's Designing Data-Intensive Applications. Is the problem in the source, the bus, the consumer, or the sink? Is state corrupted? The operational overhead is substantially higher.
Cost Explosion: A batch process runs on ephemeral compute that spins up and down. A streaming architecture requires clusters that are on 24/7, consuming resources constantly. The cloud bill for a poorly justified streaming system can be staggering.

I've seen teams spend months wrestling with exactly-once semantics, only to feed a dashboard a single vice president looks at once a week. The ROI on that engineering effort is deeply negative.

The One Question That Matters

To cut through the hype, I anchor the conversation in business reality with a single question:

"What specific business decision will be made differently with data that is five seconds old versus five hours old?"

This forces stakeholders to move from a vague desire for "freshness" to a concrete articulation of value. The answers are illuminating.

Executive Sales Dashboard: A CEO sees quarterly revenue. Will they change corporate strategy at 2:15 PM because of a sale at 2:14 PM? No. The decision cadence is daily or weekly. Verdict: Batch is perfect.
E-commerce Inventory: Three items are left. A sale occurs. Do we update the website immediately to prevent overselling? Yes. The decision has direct revenue impact. Verdict: A clear case for real-time.
Marketing Campaign Pacing: An ad manager monitors spend. Waiting 24 hours to spot overspending is costly, but sub-second data isn't needed. Verdict: A good fit for mini-batch, not full streaming.

This question transforms the debate from a technical one about tooling to a business one about value. It provides the leverage to propose a solution proportionate to the actual need.

Pragmatic Architecture: Start Simple

My philosophy is to earn your way into complexity. For the vast majority of data systems I've built or reviewed, the right approach is to build the simplest, most robust thing first. That means starting with a daily or hourly batch process. This architecture is resilient, cost-effective, and well-understood. It is the bedrock of durable platforms.

Now, the strong counterargument is that modern tools like Flink SQL or Materialize have significantly lowered the bar for streaming. It's true that writing the logic has become easier. But the operational reality of managing an always-on, distributed stateful system remains fundamentally more complex than managing a discrete, scheduled job. The failure modes are more subtle, and the cognitive load on the team is higher.

If the "decision latency" question reveals a genuine need for faster data, you evolve. You can introduce a parallel, faster path for a critical slice of data—a pattern first articulated by Nathan Marz as the Lambda Architecture. But you do this as a deliberate extension, not as your starting point for everything.

Hybrid Latency Data Architecture

Key Takeaways

Default to Batch: Treat nightly or hourly batch processing as the standard. It is robust, cheap, and sufficient for most business needs.
Ask the "Decision Latency" Question: Relentlessly ask what decision changes with fresher data. If the answer is not concrete and value-driven, real-time is a want, not a need.
Model the Total Cost: The cost of real-time is not just the cloud bill. It's the architectural complexity, operational burden, and cognitive load on your team.
Earn Your Complexity: Start with the simplest architecture that works. Evolve toward lower latency only when a clear business case justifies the added cost and fragility.