Inference / Speculative Decoding›
Foundations
Tokens & Embeddings
Transformers
Training
Fine-Tuning & RL
Model Behavior
Inference
Using Models
Evals & Measurement
Speculative Decoding
Decode is sequential, but there's a loophole, and it makes for the strangest optimization in inference.
Remember why prefill is fast: when tokens already exist, the model checks them all in parallel. Generating is sequential, but verifying is parallel. Speculative decoding exploits that gap.
It works like this. A small, fast draft model races ahead and guesses the next few tokens. Then the big model verifies the whole run in a single forward pass. Tokens that match what the big model would have said are kept. At the first miss, you take the big model's token and start guessing again from there.
The surprising result: with the right acceptance rule, the output is mathematically identical to the big model decoding alone. The draft model changes the speed, never the answer.
In practice this lands 2 to 3x faster decode, and it works best on predictable text. Code, boilerplate, and structured output are easy to guess, so the draft model nails them often.
You've shipped this pattern before. Optimistic UI updates and CPU branch prediction do the same thing: do the probable work cheaply now, pay for one authoritative check, roll back on a miss.