Judges

A judge is whatever converts an output into a score. You'll also hear "grader" or "verifier." Designing judges is much of the day-to-day work on model behavior, and they come in two kinds.

Deterministic judges are code. Some checks can be answered mechanically: count the bold spans, scan for a banned word, measure the length, run the test suite. Code costs nothing, returns the same verdict every time, and can't be talked out of it.

LLM judges handle what code can't. Is this reply terse or incomplete? Was the bolding tasteful? Did the explanation condescend? Those depend on meaning, so you prompt a capable model with a rubric and the transcript and ask it to grade.

LLM judges wobble, and they carry known biases: they favor the answer they read first, they favor longer answers, and they favor text that sounds like their own. A few habits make them far more reliable:

Compare pairs instead of scoring one to ten. "Which is better" is more consistent.
One rubric per trait. A judge grading "overall quality" produces mush; a narrow judge approaches the reliability of code.
Calibrate against humans to catch drift before it pollutes training.

Most traits end up using both. Code counts the bold spans; a judge decides whether the bolding actually helped.