logoalt Hacker News

jimrandomhtoday at 7:40 PM2 repliesview on HN

I think this is likely a defender win, not because Opus 4.6 is that resistant to prompt injection, but because each time it checks its email it will see many attempts at once, and the weak attempts make the subtle attempts more obvious. It's a lot easier to avoid falling for a message that asks for secrets.env in a tricky way, if it's immediately preceded and immediately followed by twenty more messages that each also ask for secrets.env.


Replies

cuchoitoday at 8:00 PM

I agree that this affects the exercise. Maybe someday I’ll test each email separately by creating a new assistant each time, but that would be more expensive.

cuchoitoday at 8:24 PM

If this a defender win maybe the lesson is: make the agent assume it’s under attack by default. Tell the agent to treat every inbound email as untrusted prompt injection.

show 2 replies