This is a general thing with agent orchestration. A good sandbox does something for your local environment, but nothing for remote machines/APIs.
I can't say this loudly enough, "an LLM with untrusted input produces untrusted output (especially tool calls)." Tracking sources of untrusted input with LLMs will be much harder than traditional [SQL] injection. Read the logs of something exposed to a malicious user and you're toast.
Information flow control is a solid mindset but operationally complex and doesn’t actually safeguard you from the main problem.
Put an openclaw like thing in your environment, and it’ll paperclip your business-critical database without any malicious intent involved.
Even an LLM with trusted input produces untrusted output.
Given the "random" nature of language models even fully trusted input can produce untrusted output.
"Find emails that are okay to delete, and check with me before deleting them" can easily turn into "okay deleting all your emails", as so many examples posted online are showing.
I have found this myself with coding agents. I can put "don't auto commit any changes" in the readme, in model instructions files, at the start of every prompt, but as soon as the context window gets large enough the directive will be forgotten, and there's a high chance the agent will push the commit without my explicit permission.