Event-driven pipelines before they were fashionable
Data
Exploring the foundational principles of event-driven architecture, born from necessity. How this durable pattern of decoupling is essential for modern data and AI systems.
The system kept breaking around 2 AM. A direct call from our web service to the inventory API would time out, and suddenly a customer's payment was captured with no order sent to the warehouse. The whole chain was brittle, a house of cards where one slow database connection could bring everything down.
This was over a decade ago. We didn't call our solution "event-driven architecture." We called it "making the alerts stop." We just needed to decouple our processes so that a failure in one place didn't cause a cascading disaster.
The Tyranny of Tight Coupling
The default pattern for building services was direct and synchronous. A user checks out, so the web server calls the payment service, which calls the inventory service, which calls shipping. It’s logical and linear on a whiteboard. In production, it’s a liability. This tight coupling creates a rigid chain. If the shipping service is slow, the entire checkout process hangs. If inventory is down for a patch, no one can buy anything.
We spent more time writing defensive code—timeout handlers, complex retry logic, reconciliation scripts—than we did building features. The architecture itself was the source of our operational pain. It wasn't scalable, and it certainly wasn't resilient.
A Fact Goes on the Bus
Our breakthrough was to stop telling other services what to do. Instead, each service would simply announce facts about what had just happened in its domain. The web front-end wouldn't command the warehouse to decrement inventory; it would just broadcast an immutable event: OrderPlaced, containing the order ID and items.
We had stumbled into what Gregor Hohpe and Bobby Woolf would later codify in their book Enterprise Integration Patterns as a Publish-Subscribe Channel. We put a message queue in the middle of everything. It was a simple, durable, and frankly boring piece of technology that acted as a temporary, reliable buffer. The web service's only job was to publish its event. Once the message was safely in the queue, the transaction, from the user's perspective, was done.
Downstream, other services subscribed to the events they cared about. The warehouse service listened for OrderPlaced and processed orders at its own pace. If it went down for an hour, it was no longer a catastrophe. The events would simply wait in the queue. When the service came back, it picked up right where it left off.
The Price of Decoupling
This approach isn't a silver bullet. Decoupling buys you resilience, but it costs you immediacy. The biggest hurdle, both technically and for the business, was embracing eventual consistency. The inventory count on the site might be a few seconds out of date. We were no longer operating in a world of system-wide ACID transactions.
It also introduced a new class of failure modes. What if a malformed message, a "poison pill," repeatedly fails processing and blocks a queue? You need dead-letter queues and alerting. How do you ensure a message is processed exactly once? You must build idempotency into your consumers from day one, so that processing the same OrderPlaced event twice doesn't ship two products. Your monitoring shifts from RPC latency to queue depth, consumer lag, and processing error rates.
Why This Old Pattern Runs Modern AI
The most powerful benefit of this pattern was one we didn't fully appreciate at the time: extensibility. When a new team needed to build a forecasting model, they didn't touch the core application. They just built a new service that subscribed to the existing OrderPlaced event stream. This is the exact reason these patterns are now foundational for building systems that mix AI with deterministic software.
An LLM agent performing a complex task—like summarizing a document, researching a topic, and then calling an external API—is the definition of a slow, non-deterministic, and potentially fallible process. Making a synchronous call to an agent and blocking a user request is architectural malpractice. It’s the 2 AM failure all over again.
Instead, we can use events to orchestrate the work. An AnalysisRequested event is published to a log. A deterministic pipeline might pick it up for validation, then pass it to an agent for the heavy lifting. The agent, in turn, emits events like ResearchComplete or SummaryGenerated as it makes progress. This turns the event log into what Jay Kreps famously described as the unifying abstraction for real-time data. It becomes the central nervous system for the entire application, allowing slow agentic work and fast deterministic automation to coexist and cooperate without being tightly coupled.
What to Remember
The lesson here is not to chase a trend, but to understand the trade-offs. The asynchronous, event-driven patterns we adopted out of necessity are more relevant than ever. They provide the resilience and flexibility required to build sophisticated systems where different components operate on vastly different timescales and levels of reliability.
- Broadcast facts, don't give orders. Let consumers decide how to react to events. This creates extensible, loosely coupled systems.
- Embrace eventual consistency. The price for resilience is often immediacy. For many workflows, especially those involving AI, this is a necessary and worthwhile trade.
- Use the event log as your system's backbone. It's the ideal mechanism for orchestrating work between fast, deterministic code and slow, non-deterministic AI agents.
Before building a rigid chain of direct calls, ask yourself: does this need to be immediate? Or does it need to be durable? Often, the most robust systems are the ones that choose durability.