jcardena.com Blog A decade in data: what millions of records taught me about people
145 posts
EN ES

A decade in data: what millions of records taught me about people

AI

A decade of experience with large-scale data systems shows how to architect for reality: blend deterministic automation with agentic AI and treat fairness as a core feature.

A decade in data: what millions of records taught me about people

The first time I saw it, I was sure it was a bug. In a massive transaction log from a logistics system, a tiny fraction of users consistently made a dozen small, sequential purchases instead of one large one. The data pattern was inefficient, cost more in fees, and looked like a classic case of data corruption. But it wasn't. It was people gaming a poorly designed rewards system, a human workaround invisible to the machine until you knew what to look for. The data was accurate, but the real story was hidden between the lines.

Working with data at scale teaches you this lesson again and again: a database is not a reflection of reality. It's a collection of shadows cast by human behavior, shaped by the systems we build. Learning to read those shadows is the real work.

The Data Doesn't Lie, but It Forgets to Tell the Truth

We often treat data as the ultimate source of truth. It's clean, numerical, and doesn't have opinions. But data is mute on the subject of "why." I once worked on an e-commerce platform where the analytics screamed about an alarmingly high cart abandonment rate at the final step. The numbers were clear: users were flaky or getting distracted. That was the story the dashboard told.

The real story was in the system's architecture. The shipping cost calculation was a heavyweight, external API call we delayed until the very last moment to save on performance. Users filled their carts, committed to buying, and then got hit with a surprise shipping fee that broke their mental model of the total cost. They weren't indecisive; they felt tricked. The data showed the symptom, but the root cause was a design trade-off I had made months earlier.

This is a fundamental lesson for anyone building systems today. Observability isn't just about CPU usage and query times. True observability means instrumenting the user's reality. The most durable architectures capture not just the event (item_added_to_cart) but the context (shipping_cost_not_yet_displayed). Without that context, you're just measuring echoes.

Predictability Is a Privilege, Not a Law

For years, our world was one of predictable aggregates. With enough history, we could forecast demand, user growth, and server load. We built deterministic systems based on these patterns, hard-coding business rules and assuming yesterday's patterns would hold for tomorrow. And it works, right up until the moment it doesn't.

A market shift, a new feature, a global event—suddenly, the bedrock of assumptions turns to sand. Every long-held pattern of user behavior shatters overnight. This is where the modern tension between deterministic automation and agentic systems gets interesting. The "Software 2.0" idea, which Andrej Karpathy described in a 2017 post, suggests a future where most logic is learned, not coded. In my experience, the more durable approach is a hybrid one.

RequestDeterministic CoreKnown PatternNovel ExceptionAgentic EdgeResponse
Hybrid Processing Model

The architecture should let deterministic workflows handle the vast majority of traffic that follows predictable rules. But we need resilient, adaptive components—simple LLM agents, for example—to handle the small fraction of novel exceptions. The goal isn't to replace one with the other, but to build architectures where they cooperate, letting the rigid, efficient path handle the known world while the flexible, intelligent path explores the unknown.

Bias Is the Default, Not the Edge Case

If you train a model on a decade of historical data, you have not built a predictive model. You have built a historical reenactment engine. This is the most dangerous lesson, because it’s the one we most want to ignore.

I saw this firsthand with a system designed to surface promising candidates from a resume pool. Fed with years of data on who was hired and promoted, it learned the patterns perfectly. And the main pattern it learned was that the company had historically hired a very specific demographic. The algorithm wasn't malicious; it was a perfect mirror reflecting the organization's own latent biases back at it, a textbook case of what researchers call disparate impact.

This is an architecture problem. We cannot relegate "ethics" to a final review. It must be designed into the pipeline from day one. Foundational texts like the book Fairness and Machine Learning by Barocas, Hardt, and Narayanan establish the computer science principles here. We must treat fairness as a non-functional requirement, like uptime or performance.

A Unified Architecture for Reality

Putting these lessons together—observing user context, blending deterministic and agentic work, and embedding fairness—requires a holistic view of the system. We can't bolt these features on. They must be part of the core design, from how data is ingested to how decisions are served.

INGESTION SOURCESUser ApplicationsStreaming EventsBatch FilesThird-Party APIsPROCESSING & ENRICHMENTDeterministicPipelinesAgentic WorkersData ValidationFeature StoreANALYTICS & STORAGEData WarehouseVector DatabaseMetrics StoreSERVING & ACTIONReal-Time APIsDashboardsAlerting EngineFairness Monitor
A Resilient Data and AI Architecture

This structure acknowledges that modern systems serve multiple purposes. Deterministic pipelines are still the fastest way to handle high-volume, known processes. Agentic workers provide the flexibility for new or ambiguous tasks. And a dedicated fairness monitor acts as a crucial circuit breaker, ensuring that our automated systems operate within responsible bounds.

Actionable Takeaways for Architects

After years of looking at schemas, tables, and terabytes of logs, the durable lessons aren't about a specific technology. They are about how to model the messy human world in a reliable way.

  • Instrument user context, not just system events. Your logs should tell you not only what happened, but what reality the user was experiencing when it happened.
  • Build hybrid systems. Use deterministic automation for the 95% of predictable work and agentic systems for the 5% of novel exceptions. Don't fall for the hype that one will replace the other.
  • Treat fairness as a testable, non-functional requirement. Design fairness metrics into your data pipelines from day one, with automated circuit breakers that halt the system when bias exceeds a predefined threshold.
  • Distrust certainty. The most fragile systems are those with the most deeply embedded assumptions about user behavior. Architect for adaptation.

The best architecture is ultimately an act of respect for the user. The millions of records taught me about statistics and scale, yes. But mostly, they taught me that we're not just engineering systems; we're mediating relationships, and that is a responsibility.

JC
Juan Cardena
Enterprise Architect, Data & AI

Enterprise architect with 25 years across web, software, data, and AI. MIT CDAO ’25. Writing on agentic AI in production.