>The models we have now will not do it, Except that they will, if you trick them which is trivi...

ed_mercer • yesterday at 2:06 AM • 3 replies • view on HN

>The models we have now will not do it,

Except that they will, if you trick them which is trivial.

Replies

Also if you have the weights there are a multitude of approaches to remove safeguards. It's even quite easy to accidentally flip their 'good/evil' switch (e.g. the paper where they trained it to produce code with security problems and it then started going 'hitler was a pretty good guy, actually').

K0balt • yesterday at 6:27 AM

Yes, they are easy to fool. That has nothing to do with them acting with “intention “ which is the risk here.

stressback • yesterday at 2:48 AM

I have to call BS here.

They can be coerced to do certain things but I'd like to see you or anyone prove that you can "trick" any of these models into building software that can be used autonomously kill humans. I'm pretty certain you couldn't even get it to build a design document for such software.

When there is proof of your claim, I'll eat my words. Until then, this is just lazy nonsense

➕ show 3 replies

alt Hacker News

Replies