Glossary · Term

Benchmark

Benchmark is a standard test to compare AI performance. This is the scorecard that comes out when a new model is announced, and there are frequent controversies about training for exams.

Benchmarks are standard tests for comparing the performance of multiple AI models under the same conditions. Just as students' skills are compared with the same CSAT problems, the models are scored using a set of common problems such as math problem solving, coding, and common sense questions.

This is the score table and ranking that appears every time a new model is announced, and it serves as a common standard for the industry to determine which model is superior. For researchers, it is a tool to measure technological progress, and for users, it is a reference for model selection.

However, there are constant controversies such as contamination issues where test questions are mixed into the learning data, and test-preparation training targeting only benchmark scores. A higher score does not necessarily mean more usefulness in actual work, so it is safe to view the benchmark as a reference only.

✅ Why it matters

⚠️ Limits and debates

← View all glossary entries