pass@k

Most evals ask whether a model got a task right on the first try. pass@k asks something closer to how models get used in practice: given k attempts, does at least one of them work? pass@1 is a single shot. pass@10 is ten shots, with one success enough to count the task solved.

The gap between those two numbers is the useful part. A model with pass@1 of 30% but pass@10 of 80% clearly knows how to solve the task but can't do it reliably. A model where the two sit close together is steady. Telling "can't do it" apart from "can't do it consistently" is what decides whether you reach for better sampling or a genuinely stronger model.

Measuring it well takes care. Drawing exactly k samples and checking for a win is noisy, so you sample more attempts than k, count how many pass, and use a standard estimator to recover the odds. More samples, a steadier number.

pass@k also frames what reinforcement learning does. A model that already solves a problem one time in ten has the ability buried somewhere inside it. Much of RL's job is pulling pass@10 down into pass@1, turning an occasional success into the default. It's a clean reminder that training often sharpens what a model can already do more than it adds anything new.