Synthetic Data

Real data runs out. The web has a fixed amount of good text, human labels are slow and expensive, and the exact example you need usually doesn't exist yet. Synthetic data is the way around that. You have a model generate the training data instead of collecting it.

It shows up at every stage. In pretraining, labs rewrite messy web pages into cleaner, textbook-style passages. In fine-tuning, a strong model writes worked examples for a weaker one to learn from, which is close to distillation. In reinforcement learning, the synthetic part is the task itself.

That last use is the interesting one. You can build the model a game to play: a synthetic problem with a clean pass or fail. Generate a broken codebase and check whether the tests go green. Pose a math question you already know the answer to. The model attempts it, a judge grades the result, and that grade becomes reward. Once a game runs automatically, you can play it millions of times with no human in the loop.

The catch is that a model can only generate from what it already knows, so naive self-training drifts. Train on your own average output and quality narrows toward the mean, a failure called model collapse. Synthetic data works when something outside the model keeps it honest: a verifier, a test suite, a real-world signal, or a stronger model doing the teaching.

Done well, this is the flywheel behind much of the recent progress. Find a slice of work you can pose as a game, generate the problems, grade the attempts, and let the model practice its way forward. The hard part was never making the data. It's making the game reward what you want, the same trap reward hacking keeps setting.