Batching

Everything else in inference speeds up one request. Batching is how you serve thousands at once.

Recall that decode is memory-bound: to produce one token, the GPU reads every weight in the model. Here's the leverage. That read costs the same whether one conversation is waiting or 32 are. Stream the weights through the cores once and you can advance all 32 by a token each. Same memory traffic, many times the output. It's classic amortization, the same instinct as connection pooling or bulk inserts.

The interesting part is scheduling. Early systems used static batching: collect a group, run them together, return when all finish. The problem is the slowest one holds the rest hostage. A short reply waits on a long essay in the same batch.

Continuous batching schedules at the token level instead. Every step, finished requests leave and queued ones take their slots, like a restaurant seating new parties as tables open. The GPU stays full and each response returns the moment it's done.

One wrinkle: a new request needs its prefill first, a compute-heavy burst that briefly stalls everyone else's decode. You've felt this as a mid-stream hitch in a chat app while the provider absorbs someone's giant prompt.