This is because LLMs don't actually understand language, they're just a "which word fragment comes next machine".
Instruction: don't think about ${term}
Now `${term}` is in the LLMs context window. Then the attention system will amply the logits related to `${term}` based on how often `${term}` appeared in chat. This is just how text gets transformed into numbers for the LLM to process. Relational structure of transformers will similarly amplify tokens related to `${term}` single that is what training is about, you said `fruit`, so `apple`, `orange`, `pear`, etc. all become more likely to get spat out.The negation of a term (do not under any circumstances do X) generally does not work unless they've received extensive training & fining tuning to ensure a specific "Do not generate X" will influence every single down stream weight (multiple times), which they often do for writing style & specific (illegal) terms. So for drafting emails or chatting, works fine.
But when you start getting into advanced technical concepts & profession specific jargon, not at all.
But they must have received this fine-tuning, right?
Otherwise it's hard to explain why they follow these negations in most cases (until they make a catastrophic mistake).
I often test this with ChatGPT with ad-hoc word games, I tell it increasingly convoluted wordplay instructions, forbid it from using certain words, make it do substitutions (sometimes quite creative, I can elaborate), etc, and it mostly complies until I very intentionally manage to trip it up.
If it was incapable of following negations, my wordplay games wouldn't work at all.
I did notice that once it trips up, the mistakes start to pile up faster and faster. Once it's made a serious mistakes, it's like the context becomes irreparably tainted.