I got downvoted for asking a related question recently, but I also don't think people really un...

ilitirit • yesterday at 9:16 AM • 2 replies • view on HN

I got downvoted for asking a related question recently, but I also don't think people really understood what I was asking - I'm not trying to anthropomorphise LLMs to that extent.

Basically, if you tell a model "You're an absolute moron, of course that's wrong!", will it give better or worse results? How much of that response will it absorb into its persona (like some humans tend to do)? Will it try to give "safer" responses to avoid negative feedback? How much of the associated behavior can be attributed to RLHF (e.g. like the sycophantic nature of LLMs)? How much can be attributed to training data?

Obviously this will vary by model and training, but I'm trying to get a general understanding.

I recall seeing related outcomes in some of Anthropic's studies, but I'm not sure how much of this particular aspect was studied.

Replies

fennecfoxy • yesterday at 9:19 AM

Probably quite a lot - if you look at what Anthropic found around persona vectors; https://www.anthropic.com/research/persona-vectors.

I imagine the context will always sway the model to some degree, not only for the task you're trying to get it to do (aka instructions) but also its persona, how accurate it is and the way it acts.

Foobar8568 • today at 5:22 AM

Based on my own experience with vibe coding difficult stuff outside of my expertise, I definitely got better outcome with Fuck you, shut up and do it, ffs, you are moron.

alt Hacker News

Replies