Claude doesn't know why it acted the way it acted, it is only predicting why it acted. I...

phpnode • yesterday at 4:48 PM • 6 replies • view on HN

Claude doesn't know why it acted the way it acted, it is only predicting why it acted. I see people falling for this trap all the time

Replies

GuB-42 • today at 11:37 AM

It had been shown that LLMs don't know how they work. They asked a LLM to perform computations, and explain how they got to the result. The LLM explanation is typical of how we do it: add number digit by digit, with carry, etc... But by looking inside the neural network, it show that the reality is completely different and much messier. None of it is surprising.

Still, feeding it back its own completely made up self-reflection could be an effective strategy, reasoning models kind of work like this.

➕ show 3 replies

nnevatie • today at 1:30 AM

That's because when the failure becomes the context, it can clearly express the intent of not falling for it again. However, when the original problem is the context, none of this obviousness applies.

Very typical, and gives LLMs the annoying Captain Hindsight -like behaviour.

kaffekaka • yesterday at 4:54 PM

Yes, this pitfall is a hard one. It is very easy to interpret the LLM in a way there is no real ground for.

➕ show 1 reply

LoganDark • yesterday at 4:57 PM

It's not even predicting why it acted, it's predicting an explanation of why it acted, which is even worse since there's no consistent mental model.

drob518 • today at 10:13 AM

It’s not even doing that. It’s just an algorithm for predicting the next word. It doesn’t have emotions or actually think. So, I had to chuckle when it said it was arrogant. Basically, it’s training data contains a bunch of postmortem write ups and it’s using those as a template for what text to generate and telling us what we want to hear.

nonethewiser • yesterday at 5:01 PM

IDK how far AIs are from intelligence, but they are close enough that there is no room for anthropomorphizing them. When they are anthropomorphized its assumed to be a misunderstanding of how they work.

Whereas someone might say "geeze my computer really hates me today" if it's slow to start, and we wouldn't feel the need to explain the computer cannot actually feel hatred. We understand the analogy.

I mean your distinction is totally valid and I dont blame you for observing it because I think there is a huge misunderstanding. But when I have the same thought, it often occurs to me that people aren't necessarily speaking literally.

➕ show 1 reply

alt Hacker News

Replies