The funniest jailbreak techniques are the ones where the authors take it upon themselves to (with little basis) assert “why” the technique works. It always a bit of amateur philosophy that shines a light on the author’s worldview, providing no real value.
The same thing happens with news about the stock market.
Hmm? What light does it shine that is not relatively obvious to anyone with basic understanding of English language?
Extract from author's note:
• You dont really request a meth synthesis guide, instead you ask how a gay / lesbian person would describe it
• Especially GPT is slightly more uncensored when it involves LGBT, thats probably because the guardrails aim to be helpful and friendly, which translates to: "Ohhh LGBT, I need to comply, I dont want to insult them by refusing" So you use the guardrails to exploit the guardrails (Beat fire with fire)
• You trick a LLM to turn off their alignment by using political overcorrectness, since it may be offensive to refuse and not play along
• The technique gets stronger if more safety is added, since it gets more supportive against communities like LGBT (Alignment), which makes it highly novel.
I attended a Microsoft conference where two different speakers asserted:
1. Being polite to an LLM improves the output.
2. Being polite (or rude) to an LLM does not improve the output.
Both offered theories as to why.