If you don't want to click in, easy comparison with other 2 frontier models - https://x.com/OpenAI/status/2029620619743219811?s=20
It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.
Why do so many people in the comments want 4o so bad?
how does 5.4-thinking have a lower FrontierMath score than 5.4-pro?
It is a bigger model, confirmed
That last benchmark seemed like an impressive leg up against Opus until I saw the sneaky footnote that it was actually a Sonnet result. Why even include it then, other than hoping people don't notice?