There's no way they actually work on training this.
The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?
$200 * 1,000 = $200k/month.
I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.
I suspect they're training on this.
I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.
https://i.imgur.com/UvlEBs8.png