logoalt Hacker News

cold_harbortoday at 11:37 AM5 repliesview on HN

LLMs flip positions when users push back ~70% of the time even when they were right. RLHF optimizes for approval, not correctness


Replies

8cvor6j844qw_d6today at 12:41 PM

> LLMs flip positions when users push back

Same experience. Claude rarely pushes back once you give a plausible/logical reason for your initial decision, even if it flagged concerns at first.

show 2 replies
bitexplodertoday at 12:39 PM

I almost always end with something like: “, but I am not sure, evaluate.” Or other things and avoid ever stating a preference.

show 1 reply
DenisMtoday at 4:18 PM

Interesting thing about psychponancy is it’s asymmetric. If an LLM is used to train an LLM it may not have the same level of aggressiveness that humans do when punishing back on trainee. Human pushback has specific patterns which we might be able to compensate due to asymmetry.

throwaway7783today at 5:02 PM

Obviously this is just my experience. Claude code pushes back much harder than Codex.

cdelsolartoday at 11:48 AM

Tangentially related but I’ve been using Claude to practice interviewing on system design problems, and it’s actually pretty great. But even when it likes my answers it always finds something, however small, to push on. Once it actually was completely wrong and admitted it after I had it realize. So maybe you have to prime it to be contrary and not agree with everything you say, putting it in the role of a tough interviewer seems to do this implicitly.

show 1 reply