We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.
The same thing was done with Meta researchers with Llama 4 and what can go wrong when 'independent' researchers begin to game AI benchmarks. [0]
You always have to question these benchmarks, especially when the in-house researchers can potentially game them if they wanted to.
Which is why it must be independent.
[0] https://gizmodo.com/meta-cheated-on-ai-benchmarks-and-its-a-...
Are you referring to FrontierMath?
We had access to the eval data (since we funded it), but we didn't train on the data or otherwise cheat. We didn't even look at the eval results until after the model had been trained and selected.