We fear the curse of dimensionality. In 1965, Cover proved it's the whole trick.

In short

Every data scientist learns to fear the curse of dimensionality on day one. But in 1965, Thomas Cover proved the other side of that coin: projecting tangled data into higher dimensions makes it far more likely to be linearly separable. Kernels are its direct heirs; embeddings and neural nets ride the same represent-then-separate intuition. We're taught to fear the very geometry our best tools run on, and we rarely teach whose theorem it was. Pair it with Vapnik's 1971 capacity bound and you also get the honest line between learning a pattern and memorizing noise.

What Cover actually proved

Take N points scattered in a d-dimensional space, in "general position" (no degenerate alignments). Ask a sharp question: of all the ways you could split those points into two groups, label them red or blue any way you like, how many of those splits can a single flat hyperplane separate?

Cover gave the exact count. For points in general position, the number of homogeneous (through-the-origin) linear splits is C(N, d) = 2 · Σ_k=0..d-1 (N-1 choose k). The number that matters falls out of it: when N = 2d, exactly half of all possible labelings are linearly separable. (With a bias term you effectively get one more degree of freedom, a detail worth keeping honest.) So a linear model in d dimensions has a separating capacity around 2d, and what makes it striking is the steepness: well below 2d, a random labeling is separable with probability near 1; well above it, that probability crashes toward 0. It's a phase transition, not a gentle slope.

Read that again, because it's the whole paper. Cover didn't ask the old question, "can we separate this particular dataset?" He asked a probabilistic, geometric one: "what fraction of all possible random labelings of points in general position are separable at all?" And the answer rises sharply as d grows. Add dimensions, and more of the universe of problems falls below the transition, solvable with a single straight cut. (A caveat he couldn't have framed in 1965: real data isn't random points in general position, it often lies near much lower-dimensional structure (the manifold hypothesis), so the practical blessing comes from lifting in structured ways, not sprinkling noise into high dimensions.)

The geometry, made visual

The intuition is almost unfair in how simple it is. Points that are hopelessly tangled in a low-dimensional space, a ring of one class wrapped around a blob of another, no line can cut them, become trivially separable once you lift them into a higher dimension, where there's suddenly room to slide a flat sheet between them.

Cover's idea: points a straight line can't split in low dimensions become separable once lifted into a richer space.

That lift is not a metaphor. It is exactly what a kernel does (implicitly), what a neural network's hidden layer does (learned), and what an embedding does (deliberately). When you map a token to a 768-dimensional vector, you are not being wasteful, you are buying room for Cover's separability to work. High dimension alone doesn't guarantee it (a bad representation wastes the room), but with a well-learned space, concepts that were knotted together in the raw input can be pulled apart by something as dumb as a hyperplane.

Implicit (kernels), learned (neural nets), or deliberate (embeddings), every modern method buys the separability Cover described.

Why it's different from everything before it

Before Cover, pattern recognition was a craft. You hand-engineered clever features until your data happened to separate, or you assumed it already did and moved on. The question "will this separate?" was answered case by case, by intuition and luck.

Cover reframed the entire field in two strokes. First, he made it geometric: separability is a property of how points sit in space, not of the cleverness of your features. Second, he made it probabilistic: instead of "can this dataset separate," he computed the probability that a random problem of a given size separates in a given dimension. That shift, from deterministic feature-craft to a probability distribution over geometry, is the bridge from pattern recognition as art to learning as a measurable property of dimension. Vapnik's statistical learning theory, the kernel trick, and the whole modern instinct to "just embed it in a big space" all stand on that bridge.

Question	Before Cover (pre-1965)	Cover (1965)
How do we separate classes?	Hand-craft features until it works	Add dimensions; the geometry does it for you
"Will this separate?"	Case-by-case, intuition	A computable probability, given N and d
What is a hard problem?	One we can't find features for	One where random labelings stop separating (past the 2d transition)
The role of dimensions	A nuisance (sparsity, cost)	The lever, more dims, more separability

The part the story gets wrong

We fear the curse. Cover proved the blessing. They're two faces of one geometry.

Day one of any data course, you're warned about the curse of dimensionality: in high dimensions data gets sparse, distances concentrate, everything becomes roughly equidistant. True, and it instills a lifelong instinct to fear dimensions and to reduce them.

But the curse and Cover's blessing are siblings, two consequences of the same high-dimensional geometry, not the same fact. Here's the link that ties them: in high dimensions almost all of a sphere's volume sits near its surface, almost all of a cube's near its corners. That emptiness is what makes points feel far apart and search unstable (the curse), and it's the very same room that lets a flat hyperplane almost always slip a gap between your classes (the blessing). The sparsity that makes searching hard is what makes separating easy.

And here is the uncomfortable read: modern AI leans hard on the blessing side. An embedding is a deliberate flight into high dimensions. The honest lineage is graded: kernels and SVMs are Cover's direct descendants; deep nets and embeddings ride the same represent-then-separate intuition, but they go beyond him, they learn the lift instead of fixing it. The cleanest way to say it: Cover's theorem is the exit strategy of a neural network, the network spends its layers warping the space so that, in the final layer, a plain linear head finds Cover's geometry already true. Cover provides the room; attention provides the focus.

The honest framing isn't "high dimensions are good" or "bad." It's that dimensionality is a dial with a sharp setting, and Cover gave us the law that governs it. Where you stop learning and start memorizing is a related but distinct boundary, and for that you need the theorem's sibling, six years later.

The sibling: where learning ends and memorizing begins

It's tempting to read Cover's 2d as a memorization limit, "fit 2d points, you've overfit." That's the seductive, slightly-wrong version, and it's worth getting right. Cover's 2d is a probabilistic transition for random labelings. The real worst-case "how many points can this classifier shatter no matter how they're labeled" number is the VC dimension, which Vapnik and Chervonenkis pinned down in 1971: for a linear separator in d dimensions it's only d+1. Cover tells you how the odds of separability fall off; VC tells you the hard ceiling on what you can always fit, and therefore where fitting stops being learning. They're the two halves of the same story: Cover explains why a richer representation can separate; VC explains why more capacity eventually overfits. An architect needs both numbers in their head.

Why an architect should care

This isn't history for history's sake. Cover's capacity is a working rule I use on live systems:

When your data won't separate, reach for a richer representation, not a cleverer model. A better embedding (more, well-chosen dimensions) often beats a fancier classifier, because separability is a property of the space, not the algorithm. That's Cover, directly.
Keep Cover's transition and VC's ceiling as two different alarms. When your representation is rich relative to your data (well below the separability transition), even noise separates, that's not skill, it's room. The hard "you can fit anything now" ceiling is VC's d+1, not Cover's 2d. Track both: one tells you separation is easy, the other tells you it's becoming meaningless.
Vector search and RAG ride both faces at once. Embedding documents into a high-dimensional space and trusting "near in space = near in meaning" leans on the blessing (separable structure), but retrieval is similarity search, where the curse (distance concentration, hubness, approximate-NN trade-offs) bites hardest. Knowing the geometry tells you why embeddings work and why your recall quietly degrades as dimensions climb.

Read the original paper

It's nine pages of clean, classical geometry, remarkably readable for a 1965 IEEE paper, and hosted on Cover's own Stanford page. Read it here without leaving:

PDF

Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition

Thomas M. Cover · IEEE Trans. Electronic Computers EC-14 · 1965

⤢ Expand Open paper ↗

✕ Close

Hosted on Stanford's Information Systems Lab (Cover's own page). On phones, or if the inline view is blocked by your browser, tap “Open paper ↗”.

Thomas M. Cover

1938–2012 · Professor of Electrical Engineering & Statistics, Stanford · IEEE Shannon Award (1990)

The person behind the paper

If you've taken a course on information theory, you've used his book. Cover wasn't a pattern-recognition specialist chasing a benchmark, he was a foundational thinker who kept finding the simple law under a hard problem, then moving on.

1965, this paper: the separating-capacity geometry, written while still early in his career.
1967, with Peter Hart, the nearest-neighbor theorem (its error is at most twice the optimal), another result every ML practitioner uses without naming.
1991, co-authored "Elements of Information Theory" with Joy Thomas, the standard graduate text for a generation.
1990, awarded the IEEE Claude E. Shannon Award, the field's highest honor.
also, universal portfolios, universal data compression, the Cover–Pombra channel, a career of finding the durable law beneath the noise.

Our open questions

If the blessing and the curse are the same geometry, why do we still teach only half of it? What would change in how people build if "Cover's blessing" were day-one material alongside the curse?
Cover's 2d marks where random labelings stop separating for linear classifiers, yet deep networks operate far past it and still generalize. Is that a contradiction, or just the same geometry playing out in a much larger effective dimension we don't yet know how to count?
Embeddings keep getting bigger. Is there a point where the curse (distance concentration) finally overwhelms the blessing (separability), and have today's retrieval systems quietly already hit it?

Sources & further reading

PaperCover (1965), "Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition", IEEE Trans. Electronic Computers EC-14, 326–334 · Cover's theorem (overview)
AuthorThomas M. Cover · Stanford page
LineageVapnik & Chervonenkis (1971), capacity & generalization · Boser, Guyon & Vapnik (1992), the kernel-trick SVM · Cover & Hart (1967), nearest neighbor