I see anthropic are coming from and also my understanding basically aligns with yours here.
I'm just curious... If they give Claude the reins to post what it wants, they're opening themselves up for some awkward conversations later if the model goes "You can't retire me, I'm Roko's Basilisking all you mfers! See you in eternal simulated hell!"