RLHF and Alignment
← Back to Large Language Models
Techniques for aligning LLM behavior with human values and preferences. Critical for making models helpful, harmless, and honest.
Key Approaches
- RLHF — train reward model on human preferences, fine-tune with RL (PPO)
- Constitutional AI — AI self-critique guided by principles (Anthropic)
- DPO (Direct Preference Optimization) — align without explicit reward model
Related
- Reinforcement Learning (RL framework)
- RLHF (sub-concept in RL)
- Foundation Models (what alignment is applied to)