Model Behavior / Judges

Judges

A judge is whatever converts an output into a score. You'll also hear "grader" or "verifier." Designing judges is much of the day-to-day work on model behavior, and they come in two kinds.

Deterministic judges are code. Some checks can be answered mechanically: count the bold spans, scan for a banned word, measure the length, run the test suite. Code costs nothing, returns the same verdict every time, and can't be talked out of it.

LLM judges handle what code can't. Is this reply terse or incomplete? Was the bolding tasteful? Did the explanation condescend? Those depend on meaning, so you prompt a capable model with a rubric and the transcript and ask it to grade.

LLM judges wobble, and they carry known biases: they favor the answer they read first, they favor longer answers, and they favor text that sounds like their own. A few habits make them far more reliable:

Most traits end up using both. Code counts the bold spans; a judge decides whether the bolding actually helped.