Inference / Prefill & Decode

Prefill & Decode

Every request to a model splits into two phases that behave very differently. You can read a page all at once, but you write it one word at a time. Models are the same.

Prefill processes your prompt. Every prompt token already exists, so the model reads them all in one parallel pass. It's a burst of huge matrix multiplications, and it's compute-bound: the GPU's arithmetic is the limit. Prefill produces the first output token, and its length is the pause you feel before anything appears. The metric is time to first token (TTFT).

Decode generates everything after that, one token per loop step (autoregressive). Each step does little math but must read the entire model from memory, so it's memory-bound: the limit is how fast the GPU moves bytes, not how fast it computes. The metric is time per output token (TPOT), the gap between streamed words.

If you remember one thing: prefill is compute-bound, decode is memory-bound. The two phases want different things from the same hardware, which is why large operators sometimes run them on separate GPU pools.

The asymmetry explains a lot. Long prompts stress prefill. Long outputs stress decode. Knowing which one you're hitting tells you which optimization to reach for.