That was definitely true with early LLMs but I don't know if that's still the case. Certai...

esperent • yesterday at 2:21 AM • 1 reply • view on HN

That was definitely true with early LLMs but I don't know if that's still the case. Certainly not as strong as it used to be. I think now most negative instructions are followed quite well but there's still a few things that must be deeply embedded from pretaining that are harder to avoid - these specific annoying phrasings, for example.

Replies

orbital-decay • yesterday at 1:47 PM

Both pink elephant effect and accuracy drop on negative instructions are pretty fundamental biases for both humans and LLMs. It impossible to get rid of them entirely, only mitigate them to an acceptable degree. Empirically, the only way to make a model reliable at harder negative instructions is CoT, especially a self-reflection type CoT (write a reply, verify its correctness, output a fixed version). If the native CoT fails to notice the thing that needs to be verified and you don't have the custom one or a verification loop, you're out of luck.

➕ show 1 reply

alt Hacker News

Replies