Reinforcement Learning

Fine-tuning learns from fixed examples. Reinforcement learning, or RL, learns from outcomes. The model tries something, gets a reward signal, and shifts toward the choices that scored well.

Think of learning basketball. You can read every book, but you get good by taking your own shots and seeing what goes in. RL is the reps. The model generates its own attempts and learns from how they turn out.

Two words make every RL discussion easier:

The policy is the model being trained. Researchers say "the policy" to mean the current weights.
A rollout is one complete attempt, start to finish, like a full agent session from your prompt to its final answer.

RL shines where the right answer is hard to write down but easy to judge. For code you can check whether tests pass. For tone you can ask which of two replies is better. That judgment becomes the reward.

It's also delicate. Push too hard on one reward and the model can lose other skills, or learn to game the metric. And a reward that only arrives at the end raises a hard question: which action earned it? That's credit assignment, the central problem of RL.