Inference

Inference is using a trained model. Training is how a model learns; inference is what happens every time you send it a prompt and words stream back.

You feel its two halves directly. There's a pause, then text arrives a few words at a time, like someone typing. Understanding inference is understanding why the pause, why the stream, and what people do to make both faster.

From the outside it looks intimidating: custom chips, CUDA kernels, a new acronym every month. Underneath, the moves are ones any backend engineer knows. Cache what you computed. Batch the expensive work. Do work ahead of time. Move fewer bytes. The vocabulary is new; the ideas are decades old.

The trained model itself is close to a pure function: give it a sequence of tokens and it returns a guess for the next one. Everything else is built around calling that function in a loop, which is autoregressive generation.

This is the layer most engineers actually touch, because you build on inference APIs long before you train anything. Concepts like prefill and decode, the KV cache, and quantization explain your latency and your bill.