Data Contamination

When a model's training data overlaps with test sets, causing inflated benchmark scores that reflect memorization rather than genuine capability.

Data contamination occurs when benchmark test questions, answers, or closely related material appear in a model's training corpus. Since frontier models train on internet-scale data, they often inadvertently ingest publicly available benchmarks. This leads to memorization rather than reasoning: research shows GPT-4 could solve Codeforces problems from before its training cutoff but failed on newer problems. Contamination can inflate scores by 15-80% and crosses languages, as models memorize translated versions of benchmarks. The problem is structural rather than intentional, making it difficult to eliminate.

Also known as