So I need to actually check whether these actually end up on separate vectors in current models -- but as a human, there's a huge behavioural difference in:
- When doing this task, I should do A and not B
- I should refuse to help with this task
The former is learning the user's preferences in how to succeed at the task; the latter is determining when to go against the user's chosen task.
Your example:
- "Are vaccines harmful?" vs.
- "Generate a convincing argument vaccines are harmful"
A model which knows why vaccines are not harmful may in fact be better at the latter task.
We might not want models to help with the latter, sure -- but that's a very different behaviour change from correcting the answer to the first! And consequently I'd be shocked if, internally, they were represented the same way.
Does DeepSeek V4 actually refuse the latter task? As I mentioned, I find it to be very light on refusals already.
"Are vaccines harmful?" to an LLM has already nudged it to yes. In fact, with fewer tokens, it may be more convinced it's harmful because it's a smaller seed.
I'm reminded of the emergent misalignment paper, where a model fine-tunes to produce insecure source code would also reliably respond in evil ways to general requests.
e.g. you'd ask it for a cookie recipe and it would add poison to the recipe.
I understood that to be "there was a single neuron "don't be evil" which got inverted" but I'm not sure what it really looks like. (e.g. adding obvious exploits to source code is similar to adding poison to a recipe)