logoalt Hacker News

BoredPositronlast Friday at 11:41 AM2 repliesview on HN

Architecture wise the "admit" part is impossible.


Replies

rbransonlast Friday at 1:24 PM

Bricken isn’t just making this up. He’s one of the leading researchers in model interpretability. See: https://arxiv.org/abs/2411.14257

Tepixlast Friday at 1:39 PM

Why do you think it's impossible? I just quoted him saying 'by default, it will actually say, "I don't know the answer to this question"'

We already see that ­­- given the right prompting - we can get LLMs to say more often that they don't know things.