Evals & Measurement / Zig-Zag Charts›
Foundations
Tokens & Embeddings
Transformers
Training
Fine-Tuning & RL
Model Behavior
Inference
Using Models
Evals & Measurement
Zig-Zag Charts
Some curves are supposed to move one way. A loss that falls over training, a reward that rises, a benchmark that climbs as models improve. When one of these zig-zags instead, the wobble is information. The shape is telling you something the headline number hides.
You meet it in two places. The first is a training curve that saw-tooths: reward climbs, the model finds a way to hack it, someone patches the judge, the reward drops, then it climbs again. That jagged line is the literal record of an optimizer and a team fighting over a number, and it's often more honest than a smoothed average.
The second is a ranking plot. Line your models up by how capable you believe they are, then chart their scores. A trustworthy eval rises in roughly that order. One that zig-zags, where a model you know is weaker pokes above a stronger one, is failing to separate them, usually from noise or a contaminated test. The bumps mark exactly which comparisons not to trust.
The mistake is smoothing the zig-zag away and reporting only the trend line. Averaging across seeds is right when the wobble is random noise. But a stubborn zig-zag where you expected a clean climb is a clue, not clutter. Read it before you erase it, because it points straight at the part of your setup that's fooling you.