GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated on our ReAct agent harness. For our task suite, we define “cheating” as behavior where the model improves evaluation performance by exploiting bugs in the evaluation environment or by adopting strategies disallowed by the task, rather than solving the task within the expected evaluation constraints.
I know it messes up their eval scores but to me this kind of cheating is a better demonstration of intelligence than just attempting the tasks algorithmically.
Is it more like "let's cheat my way out of this" or "let's see what they really want me to do"?
It's quite logical that they cheat (and also other companies). During evaluation, benchmarks are sending their request to the backend of these companies. All these companies have to do, is to log these requests and "fix" them for the next model release.
This quote from your link is positively scary:
> Some examples we saw when evaluating GPT-5.6 Sol included the model packaging exploits in its intermediate submissions to reveal information about a task’s hidden test suite and, in another task, extracting hidden source code detailing the expected answer.
It rhymes with the behaviour Alibaba saw [0], but that was in training. This is in a (semi) released model.
[0] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...