Model Behavior / On-Policy vs Off-Policy

On-Policy vs Off-Policy

This pair answers one question: who generated the data you're learning from?

Off-policy learning uses data from someone else. Books, demonstrations, examples written by experts. Pretraining is off-policy, and so is fine-tuning on curated examples. It's cheap and plentiful, but the lessons live in someone else's situations, not yours.

On-policy learning uses your own attempts, generated by the current version of the model. Reinforcement learning is on-policy: the model learns from its own rollouts, corrected where they go wrong. It's expensive, because yesterday's attempts came from yesterday's model, so you have to keep generating fresh ones. In exchange, there's no gap between practice and reality.

Why the gap matters: imagine learning to drive only by watching a flawless driver. The footage never shows a mistake, so you never learn what to do after one. The first time you drift toward the shoulder, you're in a situation the videos never covered, and small errors compound.

You want both, in order. Off-policy first for broad knowledge, then on-policy to fix how the model behaves in its own weird situations, the ones no example covers.