>The models we have now will not do it,
Except that they will, if you trick them which is trivial.
Yes, they are easy to fool. That has nothing to do with them acting with “intention “ which is the risk here.
I have to call BS here.
They can be coerced to do certain things but I'd like to see you or anyone prove that you can "trick" any of these models into building software that can be used autonomously kill humans. I'm pretty certain you couldn't even get it to build a design document for such software.
When there is proof of your claim, I'll eat my words. Until then, this is just lazy nonsense
Also if you have the weights there are a multitude of approaches to remove safeguards. It's even quite easy to accidentally flip their 'good/evil' switch (e.g. the paper where they trained it to produce code with security problems and it then started going 'hitler was a pretty good guy, actually').