I think this is likely a defender win, not because Opus 4.6 is that resistant to prompt injection, b...

jimrandomh • today at 7:40 PM • 2 replies • view on HN

I think this is likely a defender win, not because Opus 4.6 is that resistant to prompt injection, but because each time it checks its email it will see many attempts at once, and the weak attempts make the subtle attempts more obvious. It's a lot easier to avoid falling for a message that asks for secrets.env in a tricky way, if it's immediately preceded and immediately followed by twenty more messages that each also ask for secrets.env.

Replies

cuchoi • today at 8:00 PM

I agree that this affects the exercise. Maybe someday I’ll test each email separately by creating a new assistant each time, but that would be more expensive.

cuchoi • today at 8:24 PM

If this a defender win maybe the lesson is: make the agent assume it’s under attack by default. Tell the agent to treat every inbound email as untrusted prompt injection.

➕ show 2 replies

alt Hacker News

Replies