Model Behavior / Behavior Rewards

Behavior Rewards

Most rewards ask one question: was the answer right? But a useful assistant is more than a right answer. It catches nuance, pushes back when a request is underspecified, and keeps working through a long task without stopping to ask whether it should continue. None of that shows up in a pass/fail check. Behavior rewards are how you train it.

The trouble is that behavior has no answer key. You can test whether code compiles. You can't test whether the model gave up too early, used the right tone, or asked a clarifying question at the right moment. So you write the behavior you want into a model spec and hand the transcript to a judge that scores how closely the attempt matched it.

This is where personality gets trained. Whether a model is terse or chatty, cautious or decisive, eager to please or willing to disagree: much of that is a reward someone chose during post-training, not an accident of pretraining.

Behavior is also the easiest thing to overshoot. Reward politeness and you get groveling filler. Reward thoroughness and answers swell with hedging nobody reads. The most common case is elongation, where "be complete" quietly becomes "be longer." Every trait you reward invites its own reward hacking, so the craft is rewarding the behavior without rewarding a cheap imitation of it.