The good old "benchmarks just keep saturating" problem.
Anthropic is genuinely one of the top companies in the field, and for a reason. Opus consistently punches above its weight, and this is only in part due to the lack of OpenAI's atrocious personality tuning.
Yes, the next stop for AI is: increasing task length horizon, improving agentic behavior. The "raw general intelligence" component in bleeding edge LLMs is far outpacing the "executive function", clearly.
Shouldn't the next stop be to improve general accuracy, which is what these tools have struggled with since their inception? Until when are "AI" companies going to offload the responsibility on the user to verify the output of their tools?
Optimizing for benchmark scores, which are highly gamed to begin with, by throwing more resources at this problem is exceedingly tiring. Surely they must've noticed the performance plateau and diminishing returns of this approach by now, yet every new announcement is the same.