logoalt Hacker News

jychanglast Friday at 11:12 AM2 repliesview on HN

> I think nobody understands what's happening in these models

Quick question, do you know what "Mechanistic Interpretability Researcher" means? Because that would be a fairly bold statement if you were aware of that. Try skimming through this first: https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-ex...

> On the macro level, everyone can see simple logical flaws.

Your argument applies to humans as well. Or are you saying humans can't possibly understand bugs in code because they make simple logical flaws as well? Does that mean the existence of the Monty Hall Problem shows that humans cannot actually do math or logical reasoning?


Replies

wamiksyesterday at 12:00 PM

Thanks for the link. Yeah, interesting and creative work. I can see how it can help reason about large models. "Interpret" seems more aspirational than real. It's still largely narrative driven. I've been waiting for something deep in this area, I'm not sure it will be this community or not. For sure, as of today, the bold claim is someone understands.

> Your argument applies to humans as well

Yeah, I'm talking about obvious and trivial errors that reveal lack of representation of the code. But your question did make me think, cheers.

dns_sneklast Friday at 2:38 PM

> do you know what "Mechanistic Interpretability Researcher" means? Because that would be a fairly bold statement if you were aware of that.

The mere existence of a research field is not proof of anything except "some people are interested in this". Its certainly doesn't imply that anyone truly understands how LLMs process information, "think", or "reason".

As with all research, people have questions, ideas, theories and some of them will be right but most of them are bound to be wrong.

show 1 reply