Fine-Tuning & RL / RLHF

RLHF

RLHF stands for Reinforcement Learning from Human Feedback. It's the technique that turned capable base models into assistants people actually like talking to.

The problem it solves: "be helpful" has no formula. You can't write a loss function for it directly. RLHF gets around that by learning human taste and optimizing against it.

RLHF
GenerateRankRewardFine-tune
AI model
Reply A
Brief answer
Reply B
Clear and complete
Reply C
Unhelpful refusal
The model writes several candidate responses

It runs in a few steps:

  1. The model generates several responses to a prompt.
  2. People rank those responses from best to worst.
  3. A separate reward model is trained to predict those rankings.
  4. The original model is improved with reinforcement learning, using the reward model as its scorer.

The clever part is amortization. Humans can't rate millions of outputs during training, but they can rate enough to teach a reward model, which then scores as many as you need. RLHF works well with hundreds to thousands of carefully judged examples, because quality matters more than volume.

RLHF is where alignment got practical. It's also where reward hacking and personality drift first showed up at scale, which is why so much later work focuses on better signals.