Inference / Autoregressive Generation›
Foundations
Tokens & Embeddings
Transformers
Training
Fine-Tuning & RL
Model Behavior
Inference
Using Models
Evals & Measurement
Autoregressive Generation
A model predicts one token at a time. So how do you get a paragraph out of a function that returns a single token? You call it in a loop. Predict a token, add it to the input, predict again. This is autoregressive generation.
The word breaks down nicely: auto (self) and regressive (referring back). Each token is generated by referring back to everything the model has produced so far, including its own earlier outputs.
This loop has one property that drives nearly all of inference: it's sequential. Token five can't be computed until token four exists, because token five depends on it. You can't parallelize the output. No matter how many cores your GPU has, the words come out in order.
Reading the input is a different story. Every token in your prompt already exists, so the model can process all of them at once. That split, parallel reading versus sequential writing, is exactly the prefill and decode phases.
The sequential constraint is also what makes speculative decoding feel like magic: it finds a loophole that lets the model verify several guessed tokens in parallel, even though it can't generate them that way.