logoalt Hacker News

rmi_yesterday at 7:54 AM1 replyview on HN

Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.


Replies

XCSmeyesterday at 8:28 AM

The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.

show 1 reply