On the other hand, I found Claude/Opus to be extremely unhelpful when it comes to asking it to benchmark itself with a possible replacement.
It will get "confused", make up numbers, do a ton of other things, and I'm quite sure it is subtly sabotaging the process to show that there is no point replacing it.
I mean, Opus is not perfect, but the amount of "mistakes" it begins to do when you ask it to benchmark itself makes me suspect they are intentional. At least my system/harness.
You didn't add "never hallucinate or make anything up" to the prompt, rookie mistake.