RLHF

Back to Reinforcement Learning

Reinforcement Learning from Human Feedback. A technique used to align large language models with human preferences. Human raters rank model outputs, a reward model is trained on these rankings, and the LLM is fine-tuned using RL (typically PPO) to maximize the reward model’s score. Central to training ChatGPT, Claude, and other aligned LLMs.

ml reinforcement-learning rlhf llm alignment