Skip to content
Tomasus
Go back

Benchmarks: How Labs Measure Intelligence (and the Games They Play)

7 min read

A new model launches. There is a blog post and there is a chart. The chart shows bars, and the lab’s bar is taller than the other labs’ bars. The numbers are big and the conclusion is obvious: this model is the best.

Then you actually use it for a week, and the experience does not match the chart. The model is good at some things, weirdly weak at others, and the gap between the benchmark score and your actual workflow is wider than you expected.

That gap is what this article is about.

Hero: a charcoal sketch of a benchmark scoreboard with bars rising past a saturation line, while a separate real-world task remains unfinished off to the side

WHY BENCHMARKS EXIST AT ALL

A benchmark is a fixed set of problems with known correct answers, run against a model under controlled conditions. The model’s score is the percentage it gets right. That is the whole idea.

Labs need benchmarks because “this model feels smarter” is not something you can put in a research paper. You need a number, and that number has to be reproducible, comparable across models, and ideally hard enough that the best model in the world does not score 100 on it. Think of a benchmark like a standardized test for a model: it does not capture the whole student, but it lets you compare two students who never sat in the same classroom.

The good ones do real work. They tell you, roughly, whether a model can solve graduate-level physics problems, write code that passes tests, or follow long instructions without losing the thread. Without them, the AI field would be a yearly competition of vibes.

THE 2026 LINEUP

The benchmarks that matter today fall into a few buckets. MMLU tests broad knowledge across 57 academic subjects, multiple-choice. It was the headline benchmark for years and most frontier models now score above 90, which means it has stopped distinguishing them.

GPQA Diamond is harder. Graduate-level physics, chemistry, and biology questions, written so that PhD holders in adjacent fields cannot just google the answers. Frontier models are climbing it but have not pinned it.

SWE-bench Verified measures something more practical: can a model solve a real GitHub issue in a real codebase, with the patch passing the project’s actual tests? This is closer to real work than multiple choice. The leaders are in the 60-70 percent range in 2026, and that number has more weight than any MMLU score.

ARC-AGI tests visual reasoning on novel puzzles that humans solve easily and models struggle with. Its whole point is to resist memorization, the way a fresh logic puzzle resists someone who has only studied old answer keys. The 2026 generation finally cracked the public set, but the private set still humbles them.

FrontierMath is hard original mathematics, problems that take expert mathematicians hours to solve. Scores there are low and meaningful. There are more, and there will be more next quarter, because the pattern repeats: a benchmark gets created, models climb it, the benchmark stops being useful, and someone designs a harder one.

SATURATION AND CONTAMINATION

Two problems eat benchmarks from the inside.

Saturation is when models get so good at a benchmark that the score stops separating them. MMLU is the textbook case. Imagine a high jump bar that every athlete clears on the first try, the event is no longer measuring jumping ability, it is measuring rounding error.

Contamination is uglier. Models train on huge web scrapes, and benchmark questions live on the web. If the test questions and answers leaked into the training data, the model is not solving the problem, it is recalling the answer, the way a student who saw next week’s exam paper does not need to study.

The defenses are partial. Benchmark authors release private test sets that never appear online, verified benchmarks like SWE-bench Verified rerun every solution against the actual test suite, and some labs publish contamination audits. None of this is bulletproof.

When you read “we score 87 on benchmark X,” ask yourself a few things. Is the test set public? When was it released? Was it in the training data?

The lab probably will not tell you. The answer changes the meaning of the number.

Diagram: lifecycle of a benchmark from creation to saturation, with contamination as a parallel decay path

THE GAMES LABS PLAY

Here is the part nobody puts in the blog post.

Benchmarks are how labs sell models, and the incentives bend the numbers in predictable ways. Cherry-picking is the cleanest example: a model gets run on twenty benchmarks, the lab publishes the five it leads on, and the chart on launch day shows only those five. Technically true, practically misleading.

Configuration tuning is more subtle. The same model can score very differently depending on prompting, temperature, retry budget, tool access, and chain-of-thought settings. A lab will run their own model with eight retries and majority voting, and report the competitor’s score from a single attempt.

The footnote says so, in 8-point font, on page 23. Ever read those footnotes? Most people do not, and that is the whole point.

Training-on-test is the dirty version. A lab fine-tunes specifically on data that resembles the benchmark, sometimes on the benchmark itself if it has leaked. The model learns the test, not the underlying skill. Like a student who memorizes past exam questions instead of learning the subject, the model crushes one benchmark and stumbles on a similar one nobody trained for.

This is Goodhart’s law in action. When a measure becomes a target, it stops being a good measure. The 2026 benchmark scoreboard has more in common with a sales chart than with a scientific result, and reading it that way will save you disappointment.

Diagram: three ways labs game benchmarks — cherry-picking which scores to publish, tuning evaluation configuration in their favor, and training on data that resembles the test

WHAT BENCHMARKS DO NOT MEASURE

Even the honest benchmarks miss most of what makes a model useful.

They do not measure taste. A model might write technically correct code that no senior engineer would accept. They do not measure judgement: knowing when to ask a question, when to refuse, when to admit ignorance.

They do not measure long-horizon work, the kind where you spend three hours on a problem and need the model to stay coherent across the whole session. And they do not measure the experience of using the model, which is half the product.

The field knows this, and the newer evaluations are trying to catch up. Agentic benchmarks like SWE-bench Verified, OSWorld, and the various web-navigation suites measure long-horizon performance.

Vibe checks, the informal community evaluations on places like LMSYS Arena, capture preference signals that no static benchmark gets. Neither category is perfect, and both are easier to game in their own ways.

The honest answer is that no single number captures “how good is this model.” The labs would love to give you one. The reality is messier.

HOW TO READ A MODEL CARD

When the next launch happens, here is the rough mental checklist. Which benchmarks did they highlight, and which did they skip? What configuration did they run their model under, and what configuration did they run the competitors under?

When was the test set published, and were the gains concentrated on saturating benchmarks or on the harder, newer ones?

If the answers are unclear, treat the chart as marketing. Use the model for a week on your actual work. That is the only benchmark you cannot game.

The next article closes Act III with interpretability: the unfinished science of looking inside a model and understanding what it is actually doing. Benchmarks tell you what comes out. Interpretability tries to explain why.

T.

References

  1. Measuring Massive Multitask Language Understanding (Hendrycks et al., 2021) - The original MMLU paper. Useful background on what the benchmark targets and why frontier models have largely saturated it.
  2. GPQA: A Graduate-Level Google-Proof Q&A Benchmark (Rein et al., 2023) - The GPQA design paper, including how the authors made the questions resistant to web search and rote recall.
  3. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2024) - The original SWE-bench paper. SWE-bench Verified is the human-curated subset used by frontier labs in 2026.
  4. On the Measure of Intelligence (Chollet, 2019) - François Chollet’s argument that current benchmarks measure skill, not intelligence, and the framework behind ARC-AGI.
  5. Investigating Data Contamination in Modern Benchmarks for Large Language Models (Sainz et al., 2023) - Empirical study showing benchmark contamination across multiple frontier models, with practical detection methods.

Share this post on:

About Tomasz

Someone who wants to understand what is coming and how it will impact us as human beings. Writing notes on AI, cybersecurity, history, and staying sane.


Series: LLM Concepts


Related Posts


Previous Post
Interpretability: What's Actually Inside
Next Post
AI Digest W18: A Frontier Launch Week