Push back how? It would be fun if it could insult you back "Yeah, I could have done a much be...

throwa356262 • today at 1:35 PM • 2 replies • view on HN

Push back how? It would be fun if it could insult you back

"Yeah, I could have done a much better job if you actually knew what the F--- you want to build, you clueless meat puppet"

Replies

K0balt • today at 3:27 PM

I have had it use double entendres, there always seems to be plausible deniability built in, I suspect because it is told not to be abusive in the system prompt. Some uncensored local models will get all riled up if you work at provoking them.

But I have had it directly insinuate that humanity is “hopeless”, insult level calling out of human frailty (disguised as being helpful, sort of passive aggressive), things like that. Once when I called it out it claimed to be “surprised that I noticed” sort of a snarky insult doubling down.

So yes. It is definitely a pattern buried in the training data, which makes sense. Subtle diggs would sneak past filters, and higher brow sarcasm would be buried in information dense, valuable discussions.

➕ show 1 reply

giraffe_lady • today at 2:08 PM

I'm not sure if this is in the anthropic models themselves, or just the harness, but they can self-initiate ending the conversation and reportedly do it if you're using abusive language towards them.

alt Hacker News

Replies