LLMs flip positions when users push back ~70% of the time even when they were right. RLHF optimizes ...

cold_harbor • today at 11:37 AM • 5 replies • view on HN

LLMs flip positions when users push back ~70% of the time even when they were right. RLHF optimizes for approval, not correctness

Replies

8cvor6j844qw_d6 • today at 12:41 PM

> LLMs flip positions when users push back

Same experience. Claude rarely pushes back once you give a plausible/logical reason for your initial decision, even if it flagged concerns at first.

➕ show 2 replies

bitexploder • today at 12:39 PM

I almost always end with something like: “, but I am not sure, evaluate.” Or other things and avoid ever stating a preference.

➕ show 1 reply

DenisM • today at 4:18 PM

Interesting thing about psychponancy is it’s asymmetric. If an LLM is used to train an LLM it may not have the same level of aggressiveness that humans do when punishing back on trainee. Human pushback has specific patterns which we might be able to compensate due to asymmetry.

throwaway7783 • today at 5:02 PM

Obviously this is just my experience. Claude code pushes back much harder than Codex.

cdelsolar • today at 11:48 AM

Tangentially related but I’ve been using Claude to practice interviewing on system design problems, and it’s actually pretty great. But even when it likes my answers it always finds something, however small, to push on. Once it actually was completely wrong and admitted it after I had it realize. So maybe you have to prime it to be contrary and not agree with everything you say, putting it in the role of a tough interviewer seems to do this implicitly.

➕ show 1 reply

alt Hacker News

Replies