Evals & Measurement / Evaluation

Evaluation

How do you know if a model is good? Evaluation, usually shortened to evals, is how. The same way you wouldn't ship code without tests, you shouldn't deploy a model without evals.

An eval tests a model against tasks with known good outcomes. Some are standardized benchmarks that compare models on shared problems. The ones that matter most are custom evals built for your own use case, because they measure what you actually need rather than a generic score.

Good evals share two traits. They're representative of real usage, so a high score means real quality, and they're comprehensive enough to catch the edge cases that break in production. Building them for your domain is as important as any model choice, since without them you can't tell whether a change helped or quietly made things worse.

Evals show up everywhere in this glossary. Because models are probabilistic, evals are how you add back predictability. They turn a vague complaint like "it gives up too early" into a number that moves, which is what lets training and prompt engineering make real progress.

One discipline keeps evals honest: never train on the exact cases you evaluate with. A model graded on its own test memorizes the answers, and the score climbs while the behavior stays the same.