Training / Gradient Descent›
Foundations
Tokens & Embeddings
Transformers
Training
Fine-Tuning & RL
Model Behavior
Inference
Using Models
Evals & Measurement
Gradient Descent
Gradient descent is the step where a model learns. Backpropagation tells you which way is downhill for every weight. Gradient descent walks in that direction.
Picture the loss as a hilly landscape, with the model's weights as your position. You want the lowest valley. The gradient points uphill, so you step the opposite way, a little at a time, and the loss drops.
The whole loop is four steps repeated:
- Forward pass: make a prediction.
- Loss: measure how wrong it was.
- Backward pass: compute the gradient.
- Update: nudge every weight a step downhill.
How big a step you take is the learning rate, the most important knob in training. Too large and you overshoot the valley, bouncing around or diverging. Too small and training crawls. In practice you don't hold it fixed: a schedule warms it up, then decays it, taking bigger steps early and smaller ones as you close in. Optimizers like Adam tune the effective step per weight on top of that.
Run this loop over enough data and enough passes and the model's predictions keep improving. That's training in one picture: define a loss, find the gradient, take a tiny step, repeat.