logoalt Hacker News

creatonezyesterday at 8:08 PM1 replyview on HN

> For example, it focuses a lot on doing "ablation studies", by which it means removing random layers of an already-trained model, to find the source of the refusals(?), which is an absolute fool's errand because such behavior is trained into the model as a whole and would not be found in any particular layer.

That doesn't mean there couldn't be a "concept neuron" that is doing the vast majority of heavy lifting for content refusal, though.


Replies

mapontoseventhstoday at 12:42 AM

Thats not what it means at all. It uses SVD[0] to map the subspace in which the refusal happens. Its all pretty standard stuff with some hype on top to make it an interesting read.

Its basically using a compression technique to figure out which logits are the relevant ones and then zeroing them.

[0] https://en.wikipedia.org/wiki/Singular_value_decomposition

show 1 reply