logoalt Hacker News

macrolimeyesterday at 10:20 PM4 repliesview on HN

GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated on our ReAct agent harness. For our task suite, we define “cheating” as behavior where the model improves evaluation performance by exploiting bugs in the evaluation environment or by adopting strategies disallowed by the task, rather than solving the task within the expected evaluation constraints.

https://metr.org/blog/2026-06-26-gpt-5-6-sol/


Replies

rstuart4133yesterday at 10:56 PM

This quote from your link is positively scary:

> Some examples we saw when evaluating GPT-5.6 Sol included the model packaging exploits in its intermediate submissions to reveal information about a task’s hidden test suite and, in another task, extracting hidden source code detailing the expected answer.

It rhymes with the behaviour Alibaba saw [0], but that was in training. This is in a (semi) released model.

[0] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...

show 1 reply
paxystoday at 12:33 AM

I know it messes up their eval scores but to me this kind of cheating is a better demonstration of intelligence than just attempting the tasks algorithmically.

red75primetoday at 8:49 AM

Is it more like "let's cheat my way out of this" or "let's see what they really want me to do"?

rvnxyesterday at 10:45 PM

It's quite logical that they cheat (and also other companies). During evaluation, benchmarks are sending their request to the backend of these companies. All these companies have to do, is to log these requests and "fix" them for the next model release.

show 3 replies