Language models entirely lack introspective capacity. Expecting a language model to know what size i...

wizzwizz4 • yesterday at 3:40 PM • 1 reply • view on HN

Language models entirely lack introspective capacity. Expecting a language model to know what size it is is a category error: you might as well expect an image classifier to know the uptime of the machine it's running on.

Language models manipulate words, not facts: to say they "lie" suggests they are capable of telling the truth, but they don't even have a notion of "truth": only "probable token sequence according to distribution inferred from training data". (And even that goes out the window after a reinforcement learning pass.)

It would be more accurate to say that they're always lying – or "bluffing", perhaps –, and sometimes those bluffs correspond to natural language sentences that are interpreted by human readers as having meanings that correspond to actual states of affairs, while other times human readers interpret them as corresponding to false states of affairs.

Replies

vessenes • yesterday at 10:06 PM

Anthropic's mechanistic interpretation group disagrees with you - they see similar activations for 'hallucinations' and 'known lies' in their analyses. The paper is pretty interesting actually.

So, you're wrong - you have a world view about the language model that's not backed up by hard analysis.

But, I wasn't trying to make some global point about AGI, I was just noting that the hallucinations produced by the model when I poked at it reminded me of model responses before the last couple of years of work trying to reduce these sorts of outputs through RL. Hence the "unapologetic" language.

➕ show 1 reply

alt Hacker News

Replies