logoalt Hacker News

rao-vyesterday at 3:13 PM0 repliesview on HN

+1 this does seem to be a genuine attempt to actually build an interpretable model, so nice work!

Having said that, I worry that you run into Illusion of Conscious issues where the model changes attrition from “sandbagging” to “unctuous” when you control its response because the response is generated outside of the attribution modules (I don’t quite understand how cleanly everything flows through the concept modules and the residual). Either way this is a sophisticated problem to have. Would love to see if this can be trained to parity with modern 8B models.