logoalt Hacker News

bluefirebrandtoday at 1:02 PM1 replyview on HN

> It should be trained to answer when it knows the answer, and to state that it does not know the answer when it does not

Do LLMs even have any kind of internal model of what they know or don't know? My understanding is that they don't.


Replies

Lerctoday at 5:52 PM

There has been quite a lot of work in this area. Analysis of activations around hallucinations seems to show that there is some representation of not knowing.

There are things like https://arxiv.org/abs/2410.22071v2

But again things are not quite so simple. Detecting hallucinations might yield representations where it knew the answer but elected to hallucinate anyway because of some other obscure interaction.

Anthropics work on autoencoding activations for analysis has yielded a lot of information about the inner semantic information on models. I haven't seen a lot on bounds of knowledge there, but I wonder if that's something they hold back for competitive advantage.