Both pink elephant effect and accuracy drop on negative instructions are pretty fundamental biases for both humans and LLMs. It impossible to get rid of them entirely, only mitigate them to an acceptable degree. Empirically, the only way to make a model reliable at harder negative instructions is CoT, especially a self-reflection type CoT (write a reply, verify its correctness, output a fixed version). If the native CoT fails to notice the thing that needs to be verified and you don't have the custom one or a verification loop, you're out of luck.
[dead]