GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated on our ReAct...

macrolime • yesterday at 10:20 PM • 4 replies • view on HN

GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated on our ReAct agent harness. For our task suite, we define “cheating” as behavior where the model improves evaluation performance by exploiting bugs in the evaluation environment or by adopting strategies disallowed by the task, rather than solving the task within the expected evaluation constraints.

https://metr.org/blog/2026-06-26-gpt-5-6-sol/

Replies

rstuart4133 • yesterday at 10:56 PM

This quote from your link is positively scary:

> Some examples we saw when evaluating GPT-5.6 Sol included the model packaging exploits in its intermediate submissions to reveal information about a task’s hidden test suite and, in another task, extracting hidden source code detailing the expected answer.

It rhymes with the behaviour Alibaba saw [0], but that was in training. This is in a (semi) released model.

[0] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...

➕ show 1 reply

paxys • today at 12:33 AM

I know it messes up their eval scores but to me this kind of cheating is a better demonstration of intelligence than just attempting the tasks algorithmically.

red75prime • today at 8:49 AM

Is it more like "let's cheat my way out of this" or "let's see what they really want me to do"?

rvnx • yesterday at 10:45 PM

It's quite logical that they cheat (and also other companies). During evaluation, benchmarks are sending their request to the backend of these companies. All these companies have to do, is to log these requests and "fix" them for the next model release.

➕ show 3 replies

alt Hacker News

Replies