Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?
There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.
https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report
It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.
Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.
I would take all benchmarks with a grain of salt. I don't really use them. What's it supposed to tell me? "5% smarter", what does that mean? My experience will differ. Just try it!
I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.
On this note, is there a benchmark aggregator to compile all benchmarks in a single large grid?
At least they show competitors in any benchmark, compared to OpenAI which likes to pretend that there isn't any competitor.
Ultimately I think the only way you can trust benchmarks is if you build them yourself and keep them secret from the AI labs.
There are different levels of "cheating" on benchmarks. The worst would be just literally putting them in the loss function during RL, I assume the major labs are not cheating at that level. And I am sure they are making a genuine effort to keep the benchmark content out of the training data.
But, ultimately it seems implausible that they completely abstain from benchmarking their model until they are about to release it. Even if they did do that, the benchmark is still ultimately a part of the outermost feedback loop. So these models are all, to _some_ degree, benchmark-solving machines.
I think all we can really do is live with the model for a while and develop a subjective feeling about its quality. This shouldn't be surprising, nobody believes that coding interviews work, we all know that you just have to work with someone to figure out if they're a good programmer. As AIs become more human like it's natural they should get harder to evaluate.
This is a bit awkward, it puts us in quite a weak position as consumers.
Maybe to some extent you can get a meaningful signal from sentiments on HN etc, but:
- There must be some amount of manipulation going on of this
- Even if it was fully organic, it's highly likely that your experience will differ materially from the median online nerd, because AIs are bizarre things that respond in unpredictable ways to intangible things.