logoalt Hacker News

robertkyesterday at 8:37 PM0 repliesview on HN

You don't know what you are talking about. Obviously refusal circuitry does not live in one layer, but the repo is built on a paper with sound foundations from an Anthropic scholar working with a DeepMind interpretability mentor: https://scholar.google.com/citations?view_op=view_citation&h...