Inference / KV Cache›
Foundations
Tokens & Embeddings
Transformers
Training
Fine-Tuning & RL
Model Behavior
Inference
Using Models
Evals & Measurement
KV Cache
The KV cache is memoization applied to attention. It's the single most important optimization in inference, and every serving system has one.
Here's the waste it removes. Inside attention, each token produces two vectors, a key and a value, used by later tokens. Once computed, a token's key and value never change. Without a cache, generating each new token would redo that work for the entire sequence so far, over and over.
The savings are large. Generating 100 tokens from a 1,000-token prompt naively means redoing the key and value work for the whole growing sequence every step, roughly 105,000 token computations. With a cache, you compute the prompt once and add one token per step, about 1,100. That's nearly 100x less work.
But you've traded compute for memory, and the bill is real. The cache grows with every token of every conversation you're serving at once. A long chat can hold gigabytes, and whatever memory is left after the weights determines how many users you can serve.
That tension, compute savings paid for in scarce GPU memory, is the central resource fight in serving. It's why later tricks like PagedAttention exist: to pack those caches efficiently and fit more of them.