logoalt Hacker News

arw0nlast Friday at 9:03 AM3 repliesview on HN

I think the better word is confabulation; fabricating plausible but false narratives based on wrong memory. Fundamentally, these models try to produce plausible text. With language models getting large, they start creating internal world models, and some research shows they actually have truth dimensions. [0]

I'm not an expert on the topic, but to me it sounds plausible that a good part of the problem of confabulation comes down to misaligned incentives. These models are trained hard to be a 'helpful assistant', and this might conflict with telling the truth.

Being free of hallucinations is a bit too high a bar to set anyway. Humans are extremely prone to confabulations as well, as can be seen by how unreliable eye witness reports tend to be. We usually get by through efficient tool calling (looking shit up), and some of us through expressing doubt about our own capabilities (critical thinking).

[0] https://arxiv.org/abs/2407.12831


Replies

Tepixlast Friday at 10:58 AM

> false narratives based on wrong memory

I don't think "wrong memory" is accurate, it's missing information and doesn't know it or is trained not to admit it.

Checkout the Dwarkesh Podcast episode https://www.dwarkesh.com/p/sholto-trenton-2 starting at 1:45:38

Here is the relevant quote by Trenton Bricken from the transcript:

One example I didn't talk about before with how the model retrieves facts: So you say, "What sport did Michael Jordan play?" And not only can you see it hop from like Michael Jordan to basketball and answer basketball. But the model also has an awareness of when it doesn't know the answer to a fact. And so, by default, it will actually say, "I don't know the answer to this question." But if it sees something that it does know the answer to, it will inhibit the "I don't know" circuit and then reply with the circuit that it actually has the answer to. So, for example, if you ask it, "Who is Michael Batkin?" —which is just a made-up fictional person— it will by default just say, "I don't know." It's only with Michael Jordan or someone else that it will then inhibit the "I don't know" circuit.

But what's really interesting here and where you can start making downstream predictions or reasoning about the model, is that the "I don't know" circuit is only on the name of the person. And so, in the paper we also ask it, "What paper did Andrej Karpathy write?" And so it recognizes the name Andrej Karpathy, because he's sufficiently famous, so that turns off the "I don't know" reply. But then when it comes time for the model to say what paper it worked on, it doesn't actually know any of his papers, and so then it needs to make something up. And so you can see different components and different circuits all interacting at the same time to lead to this final answer.

show 1 reply
svaralast Friday at 9:26 AM

That's right - it does seem to have to do with trying to be helpful.

One demo of this that reliably works for me:

Write a draft of something and ask the LLM to find the errors.

Correct the errors, repeat.

It will never stop finding a list of errors!

The first time around and maybe the second it will be helpful, but after you've fixed the obvious things, it will start complaining about things that are perfectly fine, just to satisfy your request of finding errors.

show 1 reply
officialchickenlast Friday at 11:13 AM

No, the correct word is hallucinating. That's the word everyone uses and has been using. While it might not be technically correct, everyone knows what it means and more importantly, it's not a $3 word and everyone can relate to the concept. I also prefer all the _other_ more accurate alternative words Wikipedia offers to describe it:

"In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called bullshitting,[1][2] confabulation,[3] or delusion[4]) is"