> I started looking at the commits, and it's basically solving the ,,tests not pass'' problem by changing the tests themselves
Not sure if these decisions were made by the LLM, but I've always felt that Claude is more prone to doing "shady stuff" like modifying tests than finding correct solutions to problems.
GPT/Codex is more honest in this regard.
Yeah, Claude is very creative in finding ways of "solving" problems that go against what the user probably intended.
Having said that, after looking at some of the test changes, they seem to be minor things, like changing timeouts, not changing the actual intended semantics of the tests. But it's too much code to review everything, so I might be completely wrong about that, and in real-world usage, even minor changes like these will cause issues.