Model Behavior / Alignment Tax

Alignment Tax

A base model that only predicts text is, in a narrow sense, at its sharpest. The moment you train it to be helpful, safe, and on-brand, some of that raw capability slips. That loss is the alignment tax. You'll also hear it called the safety tax, and in RLHF it showed up early: the first instruction-tuned models regressed on a handful of academic benchmarks even as they got far more useful.

The cause is a kind of catastrophic forgetting. Pushing weights toward one objective drags them away from skills they already held. Every goal you add competes for the same parameters, so there's rarely a free upgrade. Gaining one behavior usually charges a little against another.

Reinforcement learning makes the trade explicit with a KL penalty. Alongside the reward, training adds a cost for drifting too far from the starting model, called the reference. Turn that cost up and capabilities hold still, but the model barely changes. Turn it down and the model moves fast, but risks forgetting old skills or sliding into reward hacking. The KL coefficient is the dial between the two.

You can't escape the tax, but you can lower the bill. Keeping the penalty tuned, mixing original data back in, or averaging the trained model with its starting point all recover capability without surrendering the behavior you trained for. The goal is to pay as little as you can for exactly what you need.