AI Benchmark

A standardized test used to measure and compare AI model performance on specific tasks, increasingly criticized for failing to predict real-world capabilities.

AI benchmarks are standardized evaluation datasets and metrics designed to measure model performance on specific tasks like coding (HumanEval), general knowledge (MMLU), or reasoning. While originally developed as scientific instruments for capability assessment, benchmarks have become susceptible to gaming through data contamination, cherry-picking, and methodology manipulation. Research shows benchmark scores can be inflated 15-80% by training data leakage alone, and the disconnect between benchmark performance and real-world utility has become a critical problem for model selection.

Also known as