Evals & Measurement / Eval Separation›
Foundations
Tokens & Embeddings
Transformers
Training
Fine-Tuning & RL
Model Behavior
Inference
Using Models
Evals & Measurement
Eval Separation
An eval is only useful if it can tell models apart. Run it on a weak model and a strong one, and the scores should pull cleanly apart, better models up, worse models down, with daylight between them. That spread is separation. Without it, the eval can't help you decide anything.
There are two ways to lose it. An eval can saturate, where every current model scores 98% and the test stops distinguishing the frontier, which is why old benchmarks get retired. Or it can be noisy, where the gap between two models is smaller than the wobble between two runs of the same model. In both cases you're reading randomness, not signal.
Separation also has to point the right way, and that's where priors come in. You already carry a rough ranking in your head: this model is clearly sharper than that one. A trustworthy eval should mostly agree with it. When an eval insists a model you know is weaker beats a stronger one, the eval is usually what's broken, not your judgment, and the fix is to hunt for the leak: a contaminated test set, a grader fooled by formatting, a prompt that happens to flatter one model.
Priors are a sanity check, not a cage. The reason to run an eval is to be surprised sometimes, to catch a real regression or a real jump you didn't see coming. The skill is knowing when a strange result means "recheck your assumptions" and when it means "the eval is lying to you." Calibrate against orderings you trust first, then trust the eval on the cases you can't call yourself.