LLM Evaluation
← Back to Large Language Models
Measuring the quality and capabilities of language models. Challenging because LLMs are general-purpose and evaluation must cover many dimensions.
Key Metrics and Benchmarks
- Perplexity — how well the model predicts the next token (lower is better)
- BLEU — n-gram overlap for translation quality
- ROUGE — recall-oriented for summarization
- Human Evaluation — gold standard but expensive and slow
- MMLU — massive multitask language understanding
- HumanEval — code generation benchmark
Related
- Classification Metrics (traditional evaluation)
- Foundation Models (what is being evaluated)