LLM-as-Judge

An evaluation technique where one language model scores the outputs of another, achieving human-level agreement on many tasks when properly calibrated.

LLM-as-Judge is an automated evaluation method that uses a language model to assess the quality, accuracy, or appropriateness of outputs from another AI system. Research shows GPT-4 achieves 85% agreement with human experts on MT-Bench, exceeding the 81% human-human agreement rate. However, LLM judges exhibit systematic biases including position bias, verbosity preference, and self-enhancement. Effective deployment requires careful prompt design, rubric specification, and calibration against human labels.

Also known as