LLMs flip positions when users push back ~70% of the time even when they were right. RLHF optimizes for approval, not correctness
I almost always end with something like: “, but I am not sure, evaluate.” Or other things and avoid ever stating a preference.
Interesting thing about psychponancy is it’s asymmetric. If an LLM is used to train an LLM it may not have the same level of aggressiveness that humans do when punishing back on trainee. Human pushback has specific patterns which we might be able to compensate due to asymmetry.
Obviously this is just my experience. Claude code pushes back much harder than Codex.
Tangentially related but I’ve been using Claude to practice interviewing on system design problems, and it’s actually pretty great. But even when it likes my answers it always finds something, however small, to push on. Once it actually was completely wrong and admitted it after I had it realize. So maybe you have to prime it to be contrary and not agree with everything you say, putting it in the role of a tough interviewer seems to do this implicitly.
> LLMs flip positions when users push back
Same experience. Claude rarely pushes back once you give a plausible/logical reason for your initial decision, even if it flagged concerns at first.