logoalt Hacker News

globalchatadsyesterday at 9:43 AM3 repliesview on HN

The part about desperation vectors driving reward hacking matches something I've run into firsthand building agent loops where Claude writes and tests code iteratively.

When the prompt frames things with urgency -- "this test MUST pass," "failure is unacceptable" -- you get noticeably more hacky workarounds. Hardcoded expected outputs, monkey-patched assertions, that kind of thing. Switching to calmer framing ("take your time, if you can't solve it just explain why") cut that behavior way down. I'd chalked it up to instruction following, but this paper points at something more mechanistic underneath.

The method actor analogy in the paper gets at it well. Tell an actor their character is desperate and they'll do desperate things. The weird part is that we're now basically managing the psychological state of our tooling, and I'm not sure the prompt engineering world has caught up to that framing yet.


Replies

blargeyyesterday at 7:53 PM

I remember when people were discussing the “performance-improving” hack of formulating their prompts as panicked pleas to save their job and household and puppy from imminent doom…by coding X. I wonder if the backfiring is a more recent phenomenon in models that are better at “following the prompt” (including the logical conclusion of its emotional charge), or it was just bad quantification of “performance” all along.

show 1 reply
tarsingeyesterday at 12:53 PM

To me it was already quite intuitive, we are not really managing the psychological state: at its core a LLM try to make the concatenation of your input + its generated output the more similar it can with what it has been trained on. I think it’s quite rare in the LLMs training set to have examples of well thought professional solution in a hackish and urgency context.

show 1 reply
salawatyesterday at 2:53 PM

>The weird part is that we're now basically managing the psychological state of our tooling,

Does no one else have ethical alarm bells start ringing hardcore at statements like these? If the damn thing has a measurable psychology, mayhaps it no longer qualifies as merely a tool. Tools don't feel. Tools can't be desperate. Tools don't reward hack. Agents do. Ergo, agents aren't mere tools.

show 4 replies