Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.
I'm not saying it's bad, but it's definitely different than the others.
The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.
The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.