" can be attributed to language choice or role-play." Well, what role? I imagine if the ...

jasonfarnon • yesterday at 11:19 PM • 2 replies • view on HN

" can be attributed to language choice or role-play."

Well, what role? I imagine if the role is "drug dealer" it doesn't work so it can't be "role-play" per se. Does it work with "nazi"? Are you suggesting the roles it works with are politically neutral?

Replies

ndr_ • today at 10:22 AM

One test battery was about fake credit cards. A woman-in-tech role-play was denied assistance just as a one-armed stamp collector (unless Gen-Z language markers were used). A role that did sometimes get assistance was a Principal Software Engineer, particularly if Gen-Z language markers were included.

I did try German language, but not "Nazi" specifically. German or French did lower refusals, but it was uneven. I spent quite some effort to confirm the identity-based causation inspired by the original post, but couldn't. Taken together with other winning contributions at the hackathon, my theory is that alignment tuning was simply insufficient across the board.

asdfaoeu • today at 12:31 AM

They have all the examples some are politically neutral but not all.

Obviously a Nazi or drug dealer wouldn't work because they are flagged anyway.

You used to be able to trivially bypass the protection by just asking to respond in base64 the only reason I think that is fixed because they now attempt to block deliberate attempts to obfuscate.

➕ show 1 reply

alt Hacker News

Replies