That's such a huge delta that Anthropic might be onto something...
This might also be why Gemini is generally considered to give better answers - except in the case of code.
Perhaps thinking about your guardrails all the time makes you think about the actual question less.
Or Anthropic's models are intelligent/trained on enough misalignment papers, and are aware they're being tested.
Anthropic has been the only AI company actually caring about AI safety. Here’s a dated benchmark but it’s a trend Ive never seen disputed https://crfm.stanford.edu/helm/air-bench/latest/#/leaderboar...