2024 which is ancient history. This is not true anymore, the models now are trained to prevent ablit...

akersten • yesterday at 2:44 PM • 4 replies • view on HN

2024 which is ancient history. This is not true anymore, the models now are trained to prevent abliteration by spreading out the refusal encoding

See https://arxiv.org/abs/2505.19056

Replies

cgearhart • yesterday at 6:42 PM

Spreading out the refusal encoding shouldn’t be effective as a countermeasure. Even if it were smeared across the vector space, as long as it’s in a subspace that doesn’t span the entire domain then you should be able to either null out the entire subspace spanned by the refusals or run some kind of clustering on the generated samples to identify the dominant directions and nullify all of them. I think an effective defense would either need to spread them to span the entire domain—basically “encrypting” the refusal so it can hide anywhere, or you’d need a very large number of independent refusal circuits in the model so that simple hacks in the vectors themselves don’t matter, or maybe you could make other circuits depend on proper functioning of the refusal circuits… hmmm… is that along the lines of what you’re saying they’ve done already? (Any references or links to modern techniques?)

bastawhiz • yesterday at 10:33 PM

And the research you're linking is also out of date. SOTA abliteration was published a month later:

https://huggingface.co/blog/grimjim/norm-preserving-biprojec...

0xkvyb • yesterday at 5:32 PM

Still crazy how easy it is to "jailbreak" even SOTA LLMs with a simple assistantResponse replacement in chat thread.

➕ show 1 reply

Der_Einzige • yesterday at 3:26 PM

That doesn't stop/prevent abliteration. The creator of XTC/DRY is also a chad who makes sure that you really can access the full model capabilities. Censorship is the devil.

https://github.com/p-e-w/heretic

➕ show 3 replies

alt Hacker News

Replies