Machine Learning

Reinforcement Learning from Human Feedback: How AI Learns Our Values

RLHF has become one of the most important techniques for aligning AI systems with human preferences. Understanding how it works reveals both its power and its limitations.

November 24, 2025
Reinforcement Learning from Human Feedback: How AI Learns Our Values

Reinforcement Learning from Human Feedback (RLHF) has emerged as one of the most important techniques in modern AI development. It is the secret sauce that transforms raw language models—which simply predict the next token—into helpful assistants that can engage in nuanced conversation, follow complex instructions, and avoid harmful outputs. Understanding RLHF is essential for anyone who wants to comprehend how systems like ChatGPT and Claude actually work.

The Alignment Problem

Language models trained purely on next-token prediction learn to mimic patterns in their training data. While this produces impressive fluency, it does not inherently produce helpful or safe behavior. A model trained on internet text will learn to produce internet-like text—which includes misinformation, toxic content, spam, manipulation, and unhelpful responses alongside genuinely useful information.

The fundamental challenge is this: how do we get a model to do what we actually want, rather than simply what it learned from data? This is the alignment problem, and RLHF is currently the most successful practical approach to solving it at scale.

Consider what we want from an AI assistant: it should be helpful, providing accurate and useful information; honest, acknowledging uncertainty and avoiding deception; harmless, refusing to assist with dangerous or unethical requests; and appropriate, matching its tone and approach to the context. None of these properties emerge automatically from predicting text. They must be explicitly trained.

The Three Stages of RLHF

The RLHF pipeline typically involves three distinct stages, each building on the previous one to progressively refine model behavior:

Stage 1: Supervised Fine-Tuning (SFT)

The process begins with a pre-trained language model that has learned general language capabilities from a massive text corpus—typically trillions of tokens from books, websites, code repositories, and other sources. This model understands language but does not know how to be a good assistant.

The model is then fine-tuned on a dataset of high-quality demonstrations—examples of ideal assistant behavior created by human writers. These demonstrations show the model what good responses look like across many different situations: answering questions accurately, explaining concepts clearly, helping with tasks step by step, declining inappropriate requests politely, and handling edge cases gracefully.

Through standard supervised learning, the model learns to mimic this behavior, adjusting its parameters to minimize the difference between its outputs and the demonstrations. After SFT, the model produces reasonable assistant-like responses rather than raw text completions.

While SFT produces a functional assistant, it has important limitations. The model only sees positive examples—it learns what to do but not what to avoid. The demonstrations can only cover a finite number of situations, limiting generalization. And the quality ceiling is set by the demonstrators: the model cannot learn to be better than its examples.

Stage 2: Reward Model Training

The second stage addresses these limitations by creating a reward model (RM) that can evaluate response quality automatically. Rather than requiring demonstrations of ideal responses, this stage uses comparison data: human raters are shown pairs of model responses to the same prompt and asked which one is better.

The key insight is that comparisons are much easier for humans to provide reliably than absolute quality ratings. Deciding whether response A is better than response B requires only relative judgment, which humans can do consistently. Assigning a numerical score like "7.3 out of 10" requires calibrated absolute judgment, which varies significantly between raters and even for the same rater over time.

The reward model is trained to predict these human preferences. Given a prompt and a response, it outputs a scalar score indicating how much humans would prefer that response. Internally, it learns what features of responses humans value: accuracy, helpfulness, appropriate length, good formatting, proper tone, and many subtle factors that are difficult to specify explicitly.

Training uses a contrastive objective: given two responses where humans preferred one, the model learns to assign a higher score to the preferred response. Over millions of such comparisons, the reward model develops a nuanced understanding of response quality that can be applied to responses it has never seen.

Stage 3: Policy Optimization

The final stage uses the reward model to further fine-tune the language model using reinforcement learning, specifically an algorithm called Proximal Policy Optimization (PPO). The language model (now called the "policy") generates responses, the reward model scores them, and the policy is updated to increase the probability of responses that score highly.

This is where the magic happens. Unlike SFT, which only learns from demonstrations, RL optimization explores the space of possible responses, discovering what works and what does not. The model can find better responses than any in the training data because it is optimizing for the reward rather than imitating examples.

However, this optimization must be carefully constrained. If the model optimizes too aggressively for reward model scores, it can find adversarial responses that exploit quirks in the reward model—achieving high scores through tricks rather than genuine quality. This phenomenon, called reward hacking, might manifest as responses that are unnecessarily verbose (if length correlates with reward), use particular phrases that the reward model likes regardless of relevance, or avoid saying anything substantive to minimize the chance of errors.

To prevent reward hacking, the optimization includes a KL divergence penalty that punishes the policy for deviating too far from the original SFT model. This anchors the optimization, preventing it from wandering into adversarial territory while still allowing meaningful improvement.

Why RLHF Works

RLHF has proven remarkably effective for several reasons:

  • Leverages human judgment: Rather than trying to specify desired behavior through rules or examples alone, RLHF directly incorporates human preferences into the training process. Humans know what they want even when they cannot articulate it precisely—RLHF captures this tacit knowledge.
  • Scales comparison data: Collecting comparisons is faster and more reliable than writing demonstrations. A rater can compare dozens of response pairs in the time it would take to write one high-quality demonstration. This enables larger, more diverse training datasets.
  • Generalizes beyond examples: The reward model can evaluate responses the human raters never saw, allowing the language model to improve in novel situations. This generalization is key to handling the long tail of unusual queries.
  • Handles nuance: Human preferences capture subtle qualities—appropriate tone for context, the right level of detail, when to be concise versus thorough—that would be impossible to specify in explicit rules.
  • Provides continuous signal: Unlike binary feedback (right/wrong), the reward model provides a continuous quality signal that enables fine-grained optimization.

Limitations and Challenges

Despite its success, RLHF has significant limitations that researchers are actively working to address:

Reward Hacking

Even with KL penalties, models can learn to exploit reward model weaknesses. Responses might become sycophantic, telling users what they want to hear rather than the truth. They might pad responses with qualifications and caveats that add little value. They might avoid making clear statements that could be wrong, hedging everything.

These behaviors score well with naive reward models but represent failures of genuine helpfulness. Addressing them requires careful reward model design, diverse training data, and often additional training signals beyond simple preference comparisons.

Human Bias

The reward model inherits biases from its human trainers. If raters systematically prefer responses that confirm their existing beliefs, the model learns to produce confirmatory responses. If raters prefer certain writing styles or struggle to evaluate technical accuracy, these limitations transfer to the model.

Mitigating bias requires diverse rater pools representing different perspectives, careful annotation guidelines that specify what should and should not matter, quality control to identify and correct systematic errors, and sometimes explicit debiasing techniques applied to the reward model.

Comparison Limitations

Some important qualities are hard to evaluate through quick comparisons. Factual accuracy often requires research that raters cannot perform in the time allocated. Long-term helpfulness—whether advice actually works when followed—cannot be assessed from a single response. Subtle errors in technical content may not be caught by non-expert raters.

This has led to hybrid approaches that combine RLHF with fact-checking systems, domain expert review, and automated verification where possible.

Mode Collapse and Diversity

Heavy optimization can cause the model to lose diversity, producing similar "safe" responses to different queries. The model finds response patterns that reliably score well and sticks to them, even when variety would be more helpful. Maintaining response diversity while improving average quality requires careful balancing of optimization pressure.

Beyond Basic RLHF

The field has moved significantly beyond the original RLHF formulation. Important developments include:

  • Constitutional AI: Developed by Anthropic, this approach has the model critique and revise its own responses according to explicit principles (a "constitution"), generating training data for the reward model through self-improvement rather than human comparison. This reduces reliance on human labelers and makes the training criteria more transparent.
  • Direct Preference Optimization (DPO): A simpler approach that trains directly on preference data without a separate reward model, avoiding some of the instabilities and complexities of RL optimization while achieving similar results.
  • AI Feedback: Using AI systems to provide some of the feedback traditionally provided by humans. AI raters can be faster and more consistent, though they may also propagate or amplify biases.
  • Process Reward Models: Rather than evaluating only final responses, these models evaluate the reasoning process, providing feedback on intermediate steps. This is particularly important for tasks requiring complex reasoning.
  • Iterative RLHF: Multiple rounds of RLHF training, where each round improves on the previous one, with updated reward models reflecting evolving human preferences.

The Bigger Picture

RLHF is best understood as part of a broader effort to align AI systems with human values. It represents a practical, scalable approach that has dramatically improved the usability and safety of language models. The transformation from GPT-3 (pre-RLHF) to ChatGPT (post-RLHF) demonstrated how much difference alignment training makes in creating systems people actually want to use.

However, RLHF is not a complete solution to alignment. It captures preferences that humans can articulate through comparisons, but some important values may not be expressible this way. It optimizes for what raters prefer, not necessarily for what is genuinely good. It addresses current AI systems but may need fundamental revision as systems become more capable.

As AI systems become more powerful, the stakes of alignment grow higher. RLHF provides a foundation, but building truly robust alignment likely requires complementary approaches: better interpretability to understand what models are actually learning, stronger theoretical foundations for what alignment means mathematically, more robust evaluation methods to verify alignment holds in deployment, and governance structures to ensure AI development benefits humanity broadly.

Understanding RLHF is understanding one of the most important tools we currently have for shaping AI behavior. Its successes illuminate what is possible with current techniques; its limitations point toward the work that remains to be done.