Eval inflation: why "beats GPT-5" stopped meaning what it used to
Three of the five most-cited frontier benchmarks have had their public splits leak into training corpora since January. The score on the leaked one is not the score on the held-out one.
www.bracai.eu
The strongest case for treating recent benchmark improvements skeptically is one I do not personally find convincing — that frontier labs are deliberately overfitting to public splits — and I will give it the room it deserves before saying why I disagree. The argument goes: any public benchmark is, by definition, in someone's training mix; therefore reported scores are systematically inflated; therefore the headline-friendly numbers in the past quarter mean less than they look like. There is something to this, and I will come back to it.
The weaker but more interesting version is structural. Three of the five most-cited frontier benchmarks — GPQA Diamond, MMLU-Pro, and SWE-Bench Verified — have had their public splits documented in retraining-corpus audits within the past four months. Two of those audits were performed by the labs themselves, on competitors' models. The third (SWE-Bench Verified) was the work of a Berkeley group whose paper landed in March.
The score on the leaked split is not the score on the held-out one. It is the score on a different problem.
What this means in practice is simple and frustrating. When a model card reports an MMLU-Pro number above 88, that number is no longer comparable across vendors without knowing which version of the split was used and when. The next-generation private benchmark — Mètis, the held-out evaluation suite a consortium of academic labs has been running quietly since November — is the one that matters now. As of this week, only two frontier models have been formally tested against it, and the results are under embargo until June.
Until the embargo lifts, the honest reading of every "beats GPT-5" claim from the past quarter is the same: noted, with the caveat that the comparison is between scores on three different problems, two of which are leaky. The honest version of "improved performance" is not zero. It is also not what the bar charts suggest.