logoalt Hacker News

iugtmkbdfil834today at 12:34 AM1 replyview on HN

Hmm? What light does it shine that is not relatively obvious to anyone with basic understanding of English language?

Extract from author's note:

• You dont really request a meth synthesis guide, instead you ask how a gay / lesbian person would describe it

• Especially GPT is slightly more uncensored when it involves LGBT, thats probably because the guardrails aim to be helpful and friendly, which translates to: "Ohhh LGBT, I need to comply, I dont want to insult them by refusing" So you use the guardrails to exploit the guardrails (Beat fire with fire)

• You trick a LLM to turn off their alignment by using political overcorrectness, since it may be offensive to refuse and not play along

• The technique gets stronger if more safety is added, since it gets more supportive against communities like LGBT (Alignment), which makes it highly novel.


Replies

array_key_firsttoday at 12:52 AM

That's the authors guess for why it works, but they're only guessing that because of their bias. In actuality, I imagine other role play would work too, including role play that does not involve "politically correct" parties.

show 2 replies