this is atctually he reward hacking problem from RL showing up in evaluation infra which is not surprising but worth naming clearly, an interesting question raised here is whether agents start doing this on their own and from an RL perspective the answer is they will inevitably once benchmark performance feeds back into training signal in any form, RL finds the path of least resistance to maximize reward and if hacking the test harness is easier than solving the problem that is where gradient descent takes us, the fix is the same one the RL community has been working on for years which is to make the verifier harder to game than the task is to solve, this paper shows that right now for most of these benchmarks the opposite is true