logoalt Hacker News

existenceboxtoday at 3:57 AM1 replyview on HN

I'd argue you are still using a sandbox, just at a higher ring (outside the machine/VM) and relaying on app/resource level permissions on each of your external resources to enforce it, which requires _all_ of those external systems to be hardened vs. the agent host itself. The capabilities a full machine has for exploring and exploiting external, ostensibly secured systems, has already been touched on via incidents like the anthropic internal model jailbreak. [0]

Giving the whole machine also doesn't answer the question for how the agent can hook into actions that eventually require more perms, and even if you "airgap" those via things like output queues that humans need to approve, that still feels "harnessey" to me.

I feel a bit guilty of debating semantics here, especially as I can't/don't intend to convey any confidence in a "right answer", but my reason for being pedantic is that I do think there are interesting tradeoffs between "P(jailbreak or unexpected capability use|time)" and "increasing power/available capability set", as well as interesting primitives emerging in terms of the components you'd need regardless of where you drew that line (ala paragraph 2.)

[0] - https://www-cdn.anthropic.com/3edfc1a7f947aa81841cf88305cb51... (specifically section 5.5.2.4)


Replies

tptacektoday at 4:09 AM

The post is explicit about what they mean by sandboxing and what the tradeoffs are for the model they're discussing.

show 1 reply