We credit the Transformer. The blueprint was already drawn in 2013 — on handwriting.
Data
An honest review of Alex Graves's overlooked 2013 paper 'Generating Sequences With Recurrent Neural Networks' — and what it reveals about which idea really changed AI.
Ask how we got to ChatGPT and you'll hear four beats: word embeddings (2013), sequence-to-sequence (2014), attention (2014), Transformer (2017). It's tidy. It's also a story told by the winners — and it skips work that was already circling the answer.
Let me be honest up front, because honesty is the point of a paper review: Graves did not invent "generate by predicting the next thing." Mikolov had RNN language models beating n-grams in 2010; Sutskever, Martens and Hinton were generating text character-by-character with RNNs in 2011; and the idea descends from Bengio's neural probabilistic language model in 2003. So why single out a 2013 paper about cursive?
What Graves actually added
Two things the others didn't have — and both are quietly load-bearing today.
First, he generated continuous data, not just discrete tokens. Text is a choice among symbols; handwriting is a real-valued trajectory — where the pen goes next is a number, not a letter. Graves's network didn't output a softmax over a vocabulary; it output the parameters of a probability distribution (a mixture of Gaussians) and sampled the next pen movement from it. That mixture-density trick is exactly how a lot of modern multimodal and robotics generation works — predicting distributions over continuous actions, not tokens. It's the most overlooked gem in the paper.
Second, he made the network decide where to look. A model that scribbles pretty nonsense is a toy. To make it write a specified word, the network had to know, at each stroke, which letter it was currently drawing.
From a sliding window to attention
His solution was a small, differentiable mechanism: a soft "window" — a mixture of Gaussians — that slid along the input text as the pen moved, so the network learned which characters to focus on while drawing. We'd call that attention now. But be precise: Graves's window was monotonic — it could move forward or pause, never jump back or attend globally. A year later, Bahdanau, Cho and Bengio generalized it into content-based attention that can look anywhere; the Transformer then made that the entire architecture. So the honest framing isn't "Graves invented attention." It's "Graves built an early, working differentiable alignment mechanism — narrow and task-specific — and a year later Bahdanau's content-based attention became the breakpoint the field actually built on."
The ideas that endure usually show up early, in a narrow, unglamorous form — and get credited to whoever generalized them later.
What made it different from everything before it
Put the lineage side by side and the distinction is sharp. Every notable sequence model before 2013 generated discrete symbols, unconditionally, on shallow recurrent units. Graves broke all three limits at once — and the third (conditional, aligned generation) points straight at how we steer LLMs today.
| Paper | Data | Model | Conditional? | What it nailed |
|---|---|---|---|---|
| Bengio 2003 | discrete | feed-forward NN | no | the theory: model sequence probability with a neural net |
| Mikolov 2010 | discrete | vanilla RNN | no | RNN-LM beats n-grams |
| Sutskever 2011 | discrete | multiplicative RNN | no | generation-as-a-loop, char-level |
| Graves 2013 | discrete + continuous | deep LSTM + MDN | YES — aligned | continuous, conditional, style-controlled synthesis |
So the difference isn't "he did it first" — it's what he could do that the others structurally could not: generate a real-valued trajectory (not just pick symbols), over long range (deep LSTM, not a shallow RNN fighting vanishing gradients), conditioned on a given input and even biased for style at sampling time. Unconditional discrete generation was in the air by 2011. Controllable, conditional, continuous generation — the thing we take for granted every time we steer a model with a prompt — arrived here.
It wasn't attention that changed everything — it was escaping the sequential wall.
One scope note first, so this lands as analysis and not a hot take: I'm talking about sequence modeling — language, translation, the lineage that became LLMs. Computer vision had already had its ImageNet moment in 2012; that's a different story.
Here's the uncomfortable read. Within that lineage, we hand the credit for the modern era to attention — because that's the word in the famous title. But Graves already had a working alignment mechanism by 2013, on an LSTM, and the lineage did not turn over. Why not? It was a narrow handwriting task, the alignment was monotonic, training and inference were still strictly sequential, and the ecosystem (datasets, benchmarks, GPUs) hadn't caught up. So attention alone wasn't the unlock.
What the Transformer actually did was kill the sequential dependency. An LSTM processes a sequence one step after another — step 1,000 can't begin until step 999 finishes. That's a wall you cannot scale through, no matter how good your idea is. The Transformer dropped recurrence so that every position could be computed at once — turning the model into something a GPU could devour in parallel. The breakthrough was a hardware-software co-design, not a new cognitive trick.
And here's the precise version, because the distinction matters: self-attention is what made that parallelism possible. You can't compute every position at once unless you have a mechanism that relates all of them in a single shot — attention's constant path length is exactly that, where recurrence forces a chain that grows with the sequence. So the honest claim isn't "attention didn't matter." It's that attention mattered as the enabler of parallel hardware, not as the cognitive "looking" trick we usually celebrate it for. Graves had the looking in 2013; what his LSTM could never have was the parallelism — because it still walked the sequence one step at a time.
To be fair to the record: parallelism wasn't unique to the Transformer either. Convolutional sequence models (Gehring et al., 2017) were already posting big training-speed gains the same year. What self-attention added was the combination that won — global, content-based interaction and a fully parallel training shape in one mechanism. So the tightest version of the claim is this: parallelizable sequence modeling was the scaling unlock, and self-attention was the mechanism that made that unlock dominant.
So Graves 2013 isn't the dawn of the Transformer era. It's the pinnacle of the sequential one — the most that predict-one-step-at-a-time could do before the field realized the real bottleneck was the word "sequential," not the word "attention."
Why an architect should care
I don't review old papers for nostalgia. I review them because they teach a discipline I use on live systems: tell the durable idea apart from its implementation — and tell the idea apart from the constraint that's actually holding you back.
The durable ideas (predict-the-next, learned alignment) were already here in 2013, and they'll be here in 2030. The implementation churned (LSTM → Transformer → whatever's next). And the thing that moved the field wasn't the most-celebrated idea — it was an unglamorous structural fact about parallelism and hardware. When you design systems that have to last, that's the muscle to build: don't fall in love with the famous idea, and don't mistake your current technique for your actual bottleneck.
Made concrete, that's three calls I actually make on real platforms:
- Real-time vs. analytical. The Transformer's parallelism is a training-and-batch superpower. At inference, generation is still autoregressive — one token at a time — which is exactly why latency and cost dominate production LLM systems. For genuinely streaming, strictly-ordered, low-latency work, the monotonic, online posture of Graves's design (and its modern heirs — RNNs, and state-space models like Mamba) is sometimes the better fit, not a relic. Match the compute shape to the workload, not to the headline.
- Bottleneck, not buzzword. The same trap shows up in vendor and architecture reviews: teams adopt the celebrated component and leave the real constraint — a sequential dependency, a sync barrier, a data-movement wall — untouched. Name the bottleneck first; the famous feature rarely is it.
- Durable interface, swappable engine. "Predict-the-next over a learned representation" outlived its 2013 engine. Build your data and AI platforms so the interface (sequence in, distribution out) is stable and the model underneath is replaceable — because it will be.
Our open questions
- If the loop was already known by 2011, why did the field need a handwriting paper to make these ideas legible — and which equally-overlooked paper are we walking past right now?
- We credited "attention" for a revolution that parallelism actually delivered. What does that say about how we'll mis-credit the next breakthrough — and how would you tell the difference in the moment?
- Graves's mixture-density output — generating continuous values, not tokens — is having a second life in robotics and multimodal models. Is "next-token" itself a limitation we'll one day look back on the way we now look back on recurrence?
Read the original paper
Twenty minutes well spent — especially the generated handwriting samples, which still feel a little uncanny. Read it here without leaving:
Sources & further reading
- PaperGraves (2013), "Generating Sequences With Recurrent Neural Networks" — arXiv:1308.0850
- AuthorAlex Graves — homepage · Wikipedia
- LineageBahdanau, Cho & Bengio (2014) — content-based attention · Sutskever, Vinyals & Le (2014) — seq2seq · Gehring et al. (2017) — convolutional seq2seq (parallel, pre-Transformer) · Vaswani et al. (2017) — "Attention Is All You Need" · Sutskever–Martens–Hinton (2011); Mikolov et al. (2010); Bengio et al. (2003)