There has been quite a lot of work in this area. Analysis of activations around hallucinations see...

Lerc • today at 5:52 PM • 0 replies • view on HN

There has been quite a lot of work in this area. Analysis of activations around hallucinations seems to show that there is some representation of not knowing.

There are things like https://arxiv.org/abs/2410.22071v2

But again things are not quite so simple. Detecting hallucinations might yield representations where it knew the answer but elected to hallucinate anyway because of some other obscure interaction.

Anthropics work on autoencoding activations for analysis has yielded a lot of information about the inner semantic information on models. I haven't seen a lot on bounds of knowledge there, but I wonder if that's something they hold back for competitive advantage.

alt Hacker News