logoalt Hacker News

hugmynutuslast Thursday at 10:46 PM1 replyview on HN

This is because LLMs don't actually understand language, they're just a "which word fragment comes next machine".

    Instruction: don't think about ${term}
Now `${term}` is in the LLMs context window. Then the attention system will amply the logits related to `${term}` based on how often `${term}` appeared in chat. This is just how text gets transformed into numbers for the LLM to process. Relational structure of transformers will similarly amplify tokens related to `${term}` single that is what training is about, you said `fruit`, so `apple`, `orange`, `pear`, etc. all become more likely to get spat out.

The negation of a term (do not under any circumstances do X) generally does not work unless they've received extensive training & fining tuning to ensure a specific "Do not generate X" will influence every single down stream weight (multiple times), which they often do for writing style & specific (illegal) terms. So for drafting emails or chatting, works fine.

But when you start getting into advanced technical concepts & profession specific jargon, not at all.


Replies

the_afyesterday at 12:59 PM

But they must have received this fine-tuning, right?

Otherwise it's hard to explain why they follow these negations in most cases (until they make a catastrophic mistake).

I often test this with ChatGPT with ad-hoc word games, I tell it increasingly convoluted wordplay instructions, forbid it from using certain words, make it do substitutions (sometimes quite creative, I can elaborate), etc, and it mostly complies until I very intentionally manage to trip it up.

If it was incapable of following negations, my wordplay games wouldn't work at all.

I did notice that once it trips up, the mistakes start to pile up faster and faster. Once it's made a serious mistakes, it's like the context becomes irreparably tainted.