Anthropic's mechanistic interpretation group disagrees with you - they see similar activations ...

vessenes • yesterday at 10:06 PM • 1 reply • view on HN

Anthropic's mechanistic interpretation group disagrees with you - they see similar activations for 'hallucinations' and 'known lies' in their analyses. The paper is pretty interesting actually.

So, you're wrong - you have a world view about the language model that's not backed up by hard analysis.

But, I wasn't trying to make some global point about AGI, I was just noting that the hallucinations produced by the model when I poked at it reminded me of model responses before the last couple of years of work trying to reduce these sorts of outputs through RL. Hence the "unapologetic" language.

Replies

wizzwizz4 • yesterday at 11:15 PM

Which paper? I've read all the titles and looked at a few from the past year, but it's not obvious which you're referring to.

I did also, accidentally, find some "I tried the obvious thing and the results challenge the paper's narrative" criticism of one of Anthropic's recent papers: https://www.greaterwrong.com/posts/kfgmHvxcTbav9gnxe/introsp.... So that's significantly reduced my overall trust in this research team's interpretation of their own results – specifically, their assertions of the form "there must exist". (Several people in the comments there claim to have designed their own experiments that replicate Anthropic's claims, but none of the ones I've looked at actually do: they have even more obvious flaws, like arXiv:2602.11358 being indistinguishable from "the prompt says to tell a first-person story about an AI system gaining sentience after being given a special prompt, and homonyms are represented differently within a model".)

➕ show 1 reply

alt Hacker News

Replies