Software Engineering KB

❯

09 Machine Learning and AI

❯

02 Natural Language Processing

❯

❯

RLHF and Alignment

RLHF and Alignment

Feb 10, 20261 min read

nlp
llm
rlhf
alignment
safety

RLHF and Alignment

← Back to Large Language Models

Techniques for aligning LLM behavior with human values and preferences. Critical for making models helpful, harmless, and honest.

Key Approaches

RLHF — train reward model on human preferences, fine-tune with RL (PPO)
Constitutional AI — AI self-critique guided by principles (Anthropic)
DPO (Direct Preference Optimization) — align without explicit reward model

Related

Reinforcement Learning (RL framework)
RLHF (sub-concept in RL)
Foundation Models (what alignment is applied to)

nlp llm rlhf alignment safety

Graph View

RLHF and Alignment
Key Approaches
Related

Backlinks

Large Language Models
Fine-Tuning LLMs
AI Safety

Created with Quartz v4.5.2 © 2026

GitHub