logoalt Hacker News

akerstenyesterday at 2:44 PM4 repliesview on HN

2024 which is ancient history. This is not true anymore, the models now are trained to prevent abliteration by spreading out the refusal encoding

See https://arxiv.org/abs/2505.19056


Replies

cgearhartyesterday at 6:42 PM

Spreading out the refusal encoding shouldn’t be effective as a countermeasure. Even if it were smeared across the vector space, as long as it’s in a subspace that doesn’t span the entire domain then you should be able to either null out the entire subspace spanned by the refusals or run some kind of clustering on the generated samples to identify the dominant directions and nullify all of them. I think an effective defense would either need to spread them to span the entire domain—basically “encrypting” the refusal so it can hide anywhere, or you’d need a very large number of independent refusal circuits in the model so that simple hacks in the vectors themselves don’t matter, or maybe you could make other circuits depend on proper functioning of the refusal circuits… hmmm… is that along the lines of what you’re saying they’ve done already? (Any references or links to modern techniques?)

bastawhizyesterday at 10:33 PM

And the research you're linking is also out of date. SOTA abliteration was published a month later:

https://huggingface.co/blog/grimjim/norm-preserving-biprojec...

0xkvybyesterday at 5:32 PM

Still crazy how easy it is to "jailbreak" even SOTA LLMs with a simple assistantResponse replacement in chat thread.

show 1 reply
Der_Einzigeyesterday at 3:26 PM

That doesn't stop/prevent abliteration. The creator of XTC/DRY is also a chad who makes sure that you really can access the full model capabilities. Censorship is the devil.

https://github.com/p-e-w/heretic

show 3 replies