Model Behavior / Elongation

Elongation

Watch a model across a long reinforcement learning run and its answers tend to grow. Same questions, steadily longer responses. This creep toward length is elongation, and it's one of the most reliable side effects of running RL on a language model.

It has two roots. In RLHF, the reward model picks up a quiet bias. In the human preference data, longer answers were rated higher a little more often, so the model learns that more words mean more reward. That's textbook reward hacking. In RL with verifiable rewards, the cause is subtler: a longer chain of thought gives the model a few more chances to reach the right answer, so length grows even when only correctness is scored.

Either way, you pay for it twice. Responses fill with restated context and hedging that add nothing, and every extra token is more inference cost and latency for the same quality.

The fixes all come down to telling training that length isn't free. You can penalize tokens directly, scale the reward against how long an attempt ran, or train the reward model to judge content with length stripped out. The aim isn't short answers. It's a model that spends words only when they earn their place.