LLM Evaluation

Back to Large Language Models

Measuring the quality and capabilities of language models. Challenging because LLMs are general-purpose and evaluation must cover many dimensions.

Key Metrics and Benchmarks

  • Perplexity — how well the model predicts the next token (lower is better)
  • BLEU — n-gram overlap for translation quality
  • ROUGE — recall-oriented for summarization
  • Human Evaluation — gold standard but expensive and slow
  • MMLU — massive multitask language understanding
  • HumanEval — code generation benchmark

nlp llm evaluation benchmarks