RLHF and Alignment

Back to Large Language Models

Techniques for aligning LLM behavior with human values and preferences. Critical for making models helpful, harmless, and honest.

Key Approaches

  • RLHF — train reward model on human preferences, fine-tune with RL (PPO)
  • Constitutional AI — AI self-critique guided by principles (Anthropic)
  • DPO (Direct Preference Optimization) — align without explicit reward model

nlp llm rlhf alignment safety