The Data Swamp Never Drained. We Just Taught It to Talk.
Data
I watched the data lake rot into a swamp for fifteen years. AI is repeating it one abstraction up, and this is the first swamp that feeds itself.
For fifteen years I watched the data lake rot into a data swamp: dump everything raw, govern it later, until no one could find a table or trust a number. We are now doing the exact same thing with AI, one rung up the ladder. Throw it at the model, spin up a RAG, add an agent, govern it later. The swamp never drained. We just taught it to talk. And this one is the first swamp that feeds itself.
I have watched this movie five times. Different decade, different artifact, same plot: something that used to be expensive becomes nearly free, everyone makes too much of it, and a swamp forms in the gap between how fast we can create and how slowly we can verify. The data swamp was the fourth time. The AI swamp is the fifth. I want to tell you why it is the same story, why this sequel is more dangerous than the others, and the one thing that has saved a system every single time.
The 3 a.m. table that didn't exist
Picture a night sometime in the 2010s. A dashboard the CEO trusts has a number that is suddenly, obviously wrong. I am tracing it backward through the "data lake" we were so proud of, the one the vendor promised would end all our integration pain. Store everything raw, schema-on-read, value will emerge.
Three hours in, the trail dies in a table named fct_orders_final_v2_use_this. No owner. No lineage. No one in the company can tell me where it comes from, who built it, or whether the join that produced it was ever correct. It is not that the data is missing. It is that the truth about the data is missing. The lake had become a swamp, and a swamp is not a place with too much water. It is a place with no flow: things go in and nothing moves, nothing is traced, nothing is trusted. The water was never the problem. The stillness was.
I think about that table constantly now, because I am watching the same thing happen to AI, and almost no one is calling it by its name.
The law that keeps repeating
Step back far enough and the whole history of computing rhymes. Each era takes one signature artifact and drives its cost of creation toward zero. Each time, we celebrate. And each time, a swamp forms, because the cost of making the artifact collapsed but the cost of verifying it did not. The gap between those two curves is where every swamp lives.
Here is the part that matters: each era's artifact becomes the substrate for the next era's creation. Instructions build pages. Pages become software. Software emits data. Data trains inference. So the swamps don't just repeat, they stack, each one inheriting the silt of the one below it.
| Era | The artifact we made free | The swamp that formed | What "later" never came |
|---|---|---|---|
| Compute (pre-web) | The instruction / program | Spaghetti code, the 1968 "software crisis" — systems larger than we could hold in our heads | Maintainability |
| Web | The page / link | The unsearchable spam web | Trust — until PageRank cut a ridge through it (links as lineage) |
| Software | The module / package | Dependency hell, supply-chain sprawl — composability outran comprehension | Provenance of what you depend on |
| Data | The record / table | The canonical data swamp — storage got cheap faster than meaning | Lineage, ownership, quality |
| AI | The inference (token, answer, agent) | The AI swamp — the first that generates its own pollution | The ability to tell real from generated |
Machine cycles got cheap, so we wrote more software than we could hold in our heads. The 1968 "software crisis" was the first swamp: systems larger than any one mind could map.
Publishing went to zero and the signal drowned in spam. PageRank cut a ridge through the swamp by treating links as lineage — it never drained it.
Reuse got free, so we composed faster than we could comprehend. Dependency hell and supply-chain sprawl: the world breaks because one invisible leg gets pulled.
Storage got cheap faster than meaning. The canonical data swamp: a 3 a.m. number nobody can trace to an owner, a source, or a reason to believe it.
Generation went to zero — and this is the first swamp that feeds itself: its output becomes the next system's input, until the human signal dilutes toward collapse.
Tap a card to read its story, or the ⤢ to view the image full-size. Five eras, one disease — and the fifth feeds itself.
Look down the last column. It is the same column every time, wearing a different costume. The swamp is never really about volume. It is about a debt we agreed to pay "later," and "later" is a country that does not exist.
The data swamp and the AI swamp are the same machine
If you lived through the data swamp, the AI swamp will feel like déjà vu, because it is. Map them row by row and the disguise falls off.
| Data Lake → Data Swamp (2010s) | AI Platform → AI Swamp (2020s) |
|---|---|
| "Store everything raw, schema-on-read." | "Throw it at the LLM, spin up a RAG, add an agent." |
| Marginal cost of a byte → 0 | Marginal cost of an answer → 0 |
| No catalog → nobody can find the table | No registry → nobody can find the model, prompt, or index |
| No lineage → nobody can trust the number | No provenance → nobody can trust the generated answer |
| No owner → orphaned pipelines rot quietly | No owner → orphaned agents run, unwatched, in production |
| No quality gate → garbage in, garbage stored | No eval gate → garbage in, garbage generated and reused |
| Cost deferred to retrieval (it's all unusable) | Cost deferred to trust (you can't tell what's real) |
| Pollution is inert — it sits and rots | Pollution is generative — it breeds |
Seven of those eight rows are a straight port. The eighth is where the sequel earns its horror.
The swamp that feeds itself
A data swamp is a passive disaster. It just sits there, rotting, waiting for someone to wade in. You could ignore it for years and it would only get staler, never worse in kind.
The AI swamp is alive. Its pollution is generative: a model produces synthetic text, plausible-but-wrong facts, AI-written code and docs, and all of it flows downstream to become the input of the next system, the next index, the next training run. The snake has found its tail. Train on enough of your own exhaust and the distribution's rare cases die first, then its center caves in. The researchers call it model collapse; the ancients would have called it an ouroboros. Either way it is the first swamp in the history of computing that does not wait for you to pollute it. It pollutes itself, at machine speed, while you sleep.
Strip the theory out. The old data swamp is the shared drive we all know: folders of final_v2_FINAL.xlsx and reports no one owns. It is a mess and it never gets cleaned — but it just sits there, and the real numbers are still in the files if you are willing to dig.
The AI swamp is the report that gets written from last quarter's report instead of from the source system. Then the next one is written from that one. Each pass smooths over a little more of reality, and because nothing ever throws an error, no one notices until the number no longer matches anything real. It is a circular reference that never warns you.
The old swamp just sat there. The new one feeds on itself until nothing real is left.
See it for yourself, no faith required
You should not take "the swamp feeds itself" on my word. Here are three ways to check it tonight, from kitchen-simple to documented, no code, no lab.
1. The photocopy you've already seen. Copy a photocopy. Then copy that copy. Then copy that. By the tenth pass the page is gray mush. Nobody added noise on purpose, each copy just lost a little, and the loss compounded because every generation's input was the previous generation's flawed output. That is model collapse, and you understood it before you read a single paper. An AI trained on AI output is a photocopier pointed at its own page.
2. A real story you can verify: citogenesis. Long before AI, the web already had a machine for counterfeiting provenance. Someone adds an unsourced "fact" to Wikipedia. A rushed reporter repeats it. Then an editor "improves" the page by citing that article as the source, and the fabrication is laundered into a footnoted truth whose footnote points back at the lie. It has a name (citogenesis), it is documented, and it has happened repeatedly. Read one real case. That loop, where plausibility manufactures its own evidence, is the hidden truth, just running by hand instead of at machine speed.
3. The slop already in the wild. Search a shopping site or your search engine for the phrase a chatbot emits when it refuses, "As an AI language model." You will find real product listings, book titles, and reviews where the machine's boilerplate leaked straight into published, "human" content, with nobody checking in between. That is the swamp's water already in the tap. You are not waiting for the AI swamp. You are standing in the shallow end.
And for the rigorous version: the Shumailov paper linked at the end of this piece runs exactly this loop under controlled conditions and plots the collapse, the tails of the distribution dying first, then the center. The photocopy, by measurement.
What to call it
Names matter, because you cannot govern what you cannot point at. I call it the AI swamp on purpose, because anyone who lived the data swamp gets it in one breath. But "swamp" undersells the new part, so the name this essay actually argues for is the swamp that feeds itself, the ouroboros that eats its own tail by design. And if you want the mechanism in two words: inference rot, where bad inference enters the loop and every future inference degrades. Accessible on the cover, sharp on the inside.
"AI is different. It gets better with scale."
This is the comfortable belief, and it is the one that builds the swamp. The argument goes: data swamps were a tooling problem, and AI tooling, registries, evals, observability, retrieval, better base models, is riding the same cost curve down. Scale plus instrumentation will tame the mess.
Here is the honest part, because a false dichotomy would be cheating you: scale genuinely did help. Bigger models are more capable, evals do catch real failures, good RAG does beat no RAG. None of that is fake, and I am not nostalgic for a pre-AI world. The capability is extraordinary.
But capability was never the thing that made a swamp. Ungoverned volume was. And here the analogy stops protecting us and starts warning us: in the data swamp, a human still had to author each bad table. There was a brake, a person, a keystroke, a unit of effort per unit of garbage. Generation removes the brake. The asymmetry is the whole game: generation is getting cheaper faster than verification. You cannot eval your way out of that when the evaluator is often the same generative process that made the artifact, an LLM grading an LLM against a swamp of retrieved documents is not governance, it is automating the rubber stamp. The question is not whether the tools are real. It is whether governance can stay upstream of generation. Right now, it lags.
The hidden truth — and it cuts both ways
Every system I have ever debugged had a hidden truth that was the real story all along, the part under the surface everyone forgot the moment the next wave arrived. The AI swamp has two of them, and they are mirror images of one fact: verification stopped being free. One is a warning. The other is the best opportunity of the decade. I refuse to give you only the warning.
It doesn't add noise. It removes the signals.
Everyone fixates on "more garbage, more hallucinations." That is the shallow read. The deeper damage is that the swamp dissolves the free signals we used to tell real from fake. Fluent writing used to be weak evidence that someone had thought. Working code was weak evidence that someone had engineered. A confident answer was weak evidence that someone had checked. Those were civilization's cheap shortcuts for trust, and AI just made every one of them counterfeitable. Plausibility became cheap enough to mimic provenance. Push the synthetic fraction high enough in a closed loop and the original human signal does not vanish so much as get overwhelmed, progressively harder and more expensive to recover unless someone deliberately keeps clean ground truth aside. How fast that threshold arrives at real-world scale is still an open question; the direction of travel is not. And because a model optimizes for the likely, it quietly shears off the long tail, the weird, the genius, the outlier, which is so often where real innovation first appears. When everything can sound true, "sounds true" stops meaning anything at all.
The flood that drowns the careless makes the curator king.
Run the same collapse in reverse and it is the best news a serious builder has had in years. When creation is free, value does not vanish, it migrates to judgment. The moat flips from "I can make more" to "I can prove what is worth keeping." The swamp destroys the commodity expert, if an AI can do it, it was process work, not expert work, and pays a steep premium for the real kind: first-hand experience, taste, the ability to say this one, not that one and be right. The new scarcity is not intelligence; it is expensive sincerity, judgment with real skin in the game, money or reputation or scarce compute actually at risk behind a claim. In every prior era we governed swamps by adding metadata: links, schemas, manifests. This time we add economic metadata, and that is an opportunity, not a tax. AI may do for trust what the web did for discovery, make it first-class infrastructure. The person the swamp makes priceless is the one you can trust to tell you what is real.
The human swamp: first the artifacts, then the résumés
Here is where the swamp stops being about files and starts being about us. The AI swamp corrupts artifacts first and credentials second, because the moment producing the appearance of competence costs nothing, the labor market loses its provenance exactly the way the data lake did. "I do AI" stops carrying information. This is not a sixth swamp; it is the human shadow the AI swamp casts the instant it forms.
Watch the signal collapse. A fifteen-year-old pasting prompts and a thirty-year architect designing a governed agent system both ship something that looks like "AI work," because the tool now manufactures the visible output. In the data era that could not happen: the result still exposed the operator's judgment. Data competence stayed legible because the output revealed the human; AI competence went dark because the output mostly reveals the model. Notice that "data is data", whether you ran it on AWS or Azure was incidental, because the discipline underneath (modeling, lineage, knowing what a wrong number costs at 3 a.m.) was the skill, and you could test for it in an hour. Today people mistake tool-access for that discipline. Holding a key to ChatGPT is not the skill, any more than holding a key to AWS ever was.
So the hidden truth has a human face, on both sides:
- The bad: the demo no longer proves the engineer. AI did not eliminate expertise. It eliminated the old evidence of it.
- The good: as counterfeit competence floods in, the premium for judgment that has survived real stakes goes up, not down. When everyone can generate a polished system, the one person who can sign their name to a guarantee is worth more, not less.
How to tell the pilot from the passenger
If the artifact no longer proves the person, what does? Three questions fluency cannot fake, because none of them ask for output, they ask for judgment under reality:
| The question | What it reveals |
|---|---|
| The architecture of rejection. "Show me the three designs you discarded, and why." | A passenger only knows what the machine offered. A practitioner carries a dense map of what they deliberately refused. |
| The failure horizon. "Where does the model become a liability in this stack, and where do you cut it off for deterministic code?" | The novice wants the LLM to do everything. The veteran knows the exact line where you stop trusting it. |
| The scars. "Tell me a decision where being wrong cost real money or reputation, and what human circuit caught it." | Prompting produces vocabulary. It cannot produce a credible postmortem with concrete trade-offs. |
And let me be careful here, because the lazy version of this section is "kids these days," and that version is both ugly and wrong. This is the opposite. The interface flattened visible effort, so the market lost its cheap way to tell a novice from an expert, and that is an indictment of the market, not of beginners. I am not the mayor of these swamps. I am someone who has had to wade out of four of them and can tell you the water is rising again. The most generous fact in this whole essay is this: the moat is open to anyone willing to earn the scars instead of laundering them through a chatbot. The AI swamp makes everyone look like a pilot, right up until the engine stops.
Why this isn't another "AI is hype" take
Let me be clear about what I am not saying, because the lazy version of this essay is a skeptic sneering at a technology he doesn't use. I use these models every day. They are the most powerful tools I have touched in a thirty-year career. This is not an argument about capability. It is an argument about operations and lineage, the unglamorous layer where I have always lived. The models are extraordinary. The swamp around them is the risk, and unlike the capability question, the swamp is a problem whose shape we already know, because we have drained four of them before.
The cure: old governance, plus one new thing
Half of the answer is boring and proven. Everything that drained the data lake ports straight up a layer, and most teams have not even done this much:
- Registries for models, prompts, and embeddings, the way we finally built catalogs for tables.
- Provenance and lineage on every generated artifact: what produced this, from what context, with which version.
- Evals as the new quality gate, the unit test of the inference world, run before an artifact is trusted, not after it has shipped.
- Ownership and lifecycle: no orphaned agents in production, every artifact has a human whose name is on it.
- AI contracts: the data-contract idea, extended, a declared, versioned interface for what a model or agent promises to consume and emit.
That is table stakes. But the generative property means table stakes are not enough, and pretending otherwise is how we lose. The genuinely new requirement is contamination control: preventing the recursive ingestion of unverified machine output. The primitive we are missing is negative provenance, not proving where a good artifact came from, but proving what a synthetic one is not, so machine output cannot silently pass as observed reality. Concretely, that means an attestation layer that is expensive to produce and hard to fake at generation time: human sign-off with real economic or reputational stake behind it, and clean ground-truth reserves kept deliberately uncontaminated to train and check against. Watermarking and provenance ratios are early, partial moves toward it. Porting governance is 2015 thinking, and we still need it. Surviving a swamp that feeds itself is the 2025 problem, and it asks for something we have never had to build before: a way to protect the concept of upstream.
Read the paper that called it a decade early
In 2015, a team at Google led by D. Sculley wrote nine of the most quietly prophetic pages in machine learning. They were not talking about LLMs, they couldn't have been, but every failure mode in this essay is in there: CACE ("changing anything changes everything"), glue code, pipeline jungles, undeclared consumers, configuration debt. It is the AI swamp, described before the flood. Read it here without leaving:
Our open questions
- Can governance ever stay upstream of generation? Every previous swamp was drained after it formed. The AI swamp forms faster than we can wade in. Is "clean it later" finally, structurally, dead, or just harder?
- What does "negative provenance" actually look like as a system? How do you cryptographically or economically separate observed reality from synthetic inference at scale, without strangling the generation that creates the value?
- If expensive sincerity becomes the moat, who can afford it? Does trust-as-infrastructure concentrate power in whoever can stake the most, or does it finally pay the careful builder what the careless one used to capture?
Sources & further reading
- PaperSculley et al. (2015), "Hidden Technical Debt in Machine Learning Systems", NeurIPS (PDF) · NeurIPS page — CACE, glue code, pipeline jungles; it predicted the AI swamp a decade early.
- PaperShumailov et al. (2023), "The Curse of Recursion: Training on Generated Data Makes Models Forget" (PDF) · Nature (2024) version — the empirical proof that the swamp feeds itself.
- LineageJames Dixon (2010), the original "data lake" coinage · the Gartner-era "data swamp" warnings (2014–2017) that named what the lake became.
- ConceptModel collapse · data mesh & data contracts, the governance answer that emerged for data and now ports up a layer.
Every era makes one layer of creation free, and the winners are the ones who make trust scale faster than output. The tools change every decade. The swamp, and the discipline it takes to drain one, do not. In an era where anyone can speak with the ghost of a million voices, the only thing that holds value is the one who can prove he is standing in the room.