Transformers / Transformers

Transformers

Before 2017, the strongest language models read text one token at a time, left to right. Each step carried a running memory forward, so they trained slowly and lost the thread of a long passage. The design that dropped that loop is the transformer, introduced in the 2017 paper "Attention Is All You Need".

Instead of stepping through a sequence, a transformer reads every position at once and uses attention to let any token pull directly from any other. Trading the step-by-step loop for one parallel pass is the difference between a for loop and a vectorized operation.

Two wins follow. Training spreads across a GPU, since positions no longer wait on each other. And any token can reach any other in a single step, so distant context survives instead of fading; a pronoun near the end of a paragraph still links to the noun that opened it.

One caveat keeps the picture honest. Reading all at once describes training and ingesting your prompt. When the model writes its reply, it still emits one token at a time, autoregressively.

This is the architecture under nearly every large language model today. The exceptions are state-space models like Mamba, which drop or interleave attention to dodge its cost.

The vocabulary was new; the winning move was subtraction. Drop the loop, keep the attention, and let scaling laws do the rest.