logoalt Hacker News

digitaltreestoday at 5:36 AM2 repliesview on HN

I have personally seen AI bypass this multiple times.


Replies

giancarlostorotoday at 6:32 AM

Sounds like they're still giving the model the keys to the kingdom, which is my point, stop giving the model the avenue to do catastrophic mistakes, it makes no sense.

show 1 reply
Terr_today at 5:52 AM

We kinda need to architect things with the assumption that all token-output from an LLM can be unpredictably sneaky and malicious.

Alas, humans suck at constant vigilance, we're built to avoid it whenever possible, so a "reverse centaur" future of "do what the AI says but only if you see it's good" is going to suck.

show 1 reply