Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro. I'm not saying it...

rmi_ • yesterday at 7:54 AM • 1 reply • view on HN

Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.

XCSme • yesterday at 8:28 AM

The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.

➕ show 1 reply

alt Hacker News