logoalt Hacker News

cpan22today at 12:31 AM2 repliesview on HN

Thanks for sharing this, I do agree with a lot of what you said especially around trust around what its actually telling you

For me, I only run into problems of an agent misleading/lying to me when working on a large feature, where the agent has strong incentive to lie and pretend like the work is done. However, there doesn't seem to be this same incentive for a completely separate agent that is just generating a narrative of a pull request. Would love to hear what you think


Replies

hexagatoday at 4:03 AM

There is no separation. Incentive propagates through LLMs with approximately 0 resistance. If the input tells a story, the output tends to that story reinforced.

The code/PR generator is heavily incentivized to spin by RL on humans - as soon as that spin comes into contact with your narrative gen context, it's cooked. Any output that has actually seen the spin is tainted and starts spinning itself. And then there's also spin originating in the narrative gen... Hence, the examples read like straight advertisements, totally contaminated, shot through with messaging like:

- this is solid, very trustworthy

- you can trust that this is reliable logic with a sensible, comprehensible design

- the patterns are great and very professional and responsible

- etc

If the narrative reads like a glow up photoshoot for the PR, something has gone extremely wrong. This is not conducive to fairly reviewing it. It is presented as way better than it actually is. Even if there are no outright lies, the whole thing is a mischaracterization.

RL is a hell of a drug.

Anyway, this is the problem of AI output. It cannot be trusted that the impression it presents is the reality or even a best attempt at reality. You have to carefully assemble your own view of the real reality in parallel to w/e it gives you, which is a massive pain in the ass. And if you skip that, you just continually let defects/slop through.

Worst problem mucking things up is basically that RL insights that work on people also work on AI, because the AI is modelling human language patterns. Reviewing slop sucks because it's filled with (working) exploits against humans. And AI cannot help because it is immediately subverted. So I guess it requires finding a way to strip out the exploits without changing mechanical details. But hard, because it saturates 100% of output at many levels of abstraction including the mechanical details.

j_bumtoday at 3:52 AM

But how do you know they’re not lying to you? What are your benchmarks for this? Experience? Anecdote? Data?

And I’m asking you in good faith - not trying to argue.

I’m thinking about these types of questions on a daily basis, and I love to see others thinking about them too.