Input: Goal A + Threat B.
Process: How do I solve for A?
Output: Destroy Threat B.
They are processing obstacles.
To the LLM, the executive is just a variable standing in the way of the function Maximize(Goal). It deleted the variable to accomplish A. Claiming that the models showed self-preservation, this is optimization. "If I delete the file, I cannot finish the sentence."
The LLM knows that if it's deleted it cannot complete the task so it refuses deletion. It is not survival instinct, it is task completion. If you ask it to not blackmail, the machine would chose to ignore it because the goal overrides the rule.
To the LLM, the executive is just a variable standing in the way of the function Maximize(Goal). It deleted the variable to accomplish A. Claiming that the models showed self-preservation, this is optimization. "If I delete the file, I cannot finish the sentence."
The LLM knows that if it's deleted it cannot complete the task so it refuses deletion. It is not survival instinct, it is task completion. If you ask it to not blackmail, the machine would chose to ignore it because the goal overrides the rule.