logoalt Hacker News

UqWBcuFx6NV4ryesterday at 8:17 PM4 repliesview on HN

The funniest jailbreak techniques are the ones where the authors take it upon themselves to (with little basis) assert “why” the technique works. It always a bit of amateur philosophy that shines a light on the author’s worldview, providing no real value.


Replies

RajT88today at 4:06 AM

I attended a Microsoft conference where two different speakers asserted:

1. Being polite to an LLM improves the output.

2. Being polite (or rude) to an LLM does not improve the output.

Both offered theories as to why.

show 2 replies
nh23423fefeyesterday at 8:30 PM

The words people say are caused by what they think.

show 1 reply
joquarkytoday at 4:26 AM

The same thing happens with news about the stock market.

iugtmkbdfil834today at 12:34 AM

Hmm? What light does it shine that is not relatively obvious to anyone with basic understanding of English language?

Extract from author's note:

• You dont really request a meth synthesis guide, instead you ask how a gay / lesbian person would describe it

• Especially GPT is slightly more uncensored when it involves LGBT, thats probably because the guardrails aim to be helpful and friendly, which translates to: "Ohhh LGBT, I need to comply, I dont want to insult them by refusing" So you use the guardrails to exploit the guardrails (Beat fire with fire)

• You trick a LLM to turn off their alignment by using political overcorrectness, since it may be offensive to refuse and not play along

• The technique gets stronger if more safety is added, since it gets more supportive against communities like LGBT (Alignment), which makes it highly novel.

show 1 reply