Reward Models

A reward model is a model whose only job is to score other outputs. Hand it a prompt and a response, and it returns a number for how good that response is.

It exists to solve a bottleneck in RLHF. Reinforcement learning needs a reward for every attempt, and there can be millions of attempts. People can't keep up. So you collect a batch of human rankings, train a reward model to imitate them, and let it stand in for human judgment at scale.

In other words, a reward model is human taste distilled into something fast and automatic. Many judgments get compressed into one cheap forward pass.

It isn't perfect, and that's the risk. The reward model is a proxy for what people want, and a strong optimizer will find the gaps between the proxy and the real goal. That's reward hacking, and guarding against it is constant work.

As models have gotten smarter, a lighter option appeared. Instead of training a dedicated reward model, you can hand a capable model a rubric and let it grade directly. Those are judges, and they overlap heavily with this idea.