logoalt Hacker News

tekneyesterday at 4:45 PM3 repliesview on HN

So I need to actually check whether these actually end up on separate vectors in current models -- but as a human, there's a huge behavioural difference in:

- When doing this task, I should do A and not B

- I should refuse to help with this task

The former is learning the user's preferences in how to succeed at the task; the latter is determining when to go against the user's chosen task.

Your example:

- "Are vaccines harmful?" vs.

- "Generate a convincing argument vaccines are harmful"

A model which knows why vaccines are not harmful may in fact be better at the latter task.

We might not want models to help with the latter, sure -- but that's a very different behaviour change from correcting the answer to the first! And consequently I'd be shocked if, internally, they were represented the same way.


Replies

andaiyesterday at 4:51 PM

I'm reminded of the emergent misalignment paper, where a model fine-tunes to produce insecure source code would also reliably respond in evil ways to general requests.

e.g. you'd ask it for a cookie recipe and it would add poison to the recipe.

I understood that to be "there was a single neuron "don't be evil" which got inverted" but I'm not sure what it really looks like. (e.g. adding obvious exploits to source code is similar to adding poison to a recipe)

zozbot234yesterday at 4:48 PM

Does DeepSeek V4 actually refuse the latter task? As I mentioned, I find it to be very light on refusals already.

show 1 reply
cyanydeezyesterday at 7:04 PM

"Are vaccines harmful?" to an LLM has already nudged it to yes. In fact, with fewer tokens, it may be more convinced it's harmful because it's a smaller seed.