logoalt Hacker News

kolinkoyesterday at 10:51 PM1 replyview on HN

On the other hand, I found Claude/Opus to be extremely unhelpful when it comes to asking it to benchmark itself with a possible replacement.

It will get "confused", make up numbers, do a ton of other things, and I'm quite sure it is subtly sabotaging the process to show that there is no point replacing it.

I mean, Opus is not perfect, but the amount of "mistakes" it begins to do when you ask it to benchmark itself makes me suspect they are intentional. At least my system/harness.


Replies

krappyesterday at 11:05 PM

You didn't add "never hallucinate or make anything up" to the prompt, rookie mistake.