logoalt Hacker News

DontchaKnowityesterday at 11:41 PM2 repliesview on HN

Do open weight models have similar content gaurdrails in place?


Replies

benkaisertoday at 2:06 AM

Often there are "abliterated" or "uncensored" tuned models that suppress the rejections. From my high level understanding it is performed by finding which weights activate for the rejection and lowering those so the model is less likely to reject. It doesn't fix if the model doesn't know what you're asking it though (i.e. if the model never actually learned about meth production in the first place).

ykyesterday at 11:56 PM

No, but actually yes. Guardrails usually refers to a step in the inference pipeline where you check that it is consistent with policy while open weight models don't come with such a multistep pipeline. However open weight models are aligned during RLHF step, which means they will refuse to discuss overly sensitive topics. There are techniques to remove those, if you look for uncensored models on huggingface.