logoalt Hacker News

aminerjtoday at 12:44 AM2 repliesview on HN

The trust boundary framing is the right mental model. The flat context window problem is exactly why prompt hardening alone only got from 95% to 85% in my testing. The model has no architectural mechanism to treat retrieved documents differently from system instructions, only a probabilistic prior from training.

The UNTRUSTED markers approach is essentially making that implicit trust hierarchy explicit in the prompt structure. I'd be curious how you handle the case where the adversarial document is specifically engineered to look like it originated from a trusted source. That's what the semantic injection variant in the companion article demonstrates: a payload designed to look like an internal compliance policy, not external content.

One place I'd push back: "you can't reliably distinguish adversarial documents from legitimate ones" is true at the content level but less true at the signal level. The coordinated injection pattern I tested produces a detectable signature before retrieval: multiple documents arriving simultaneously, clustering tightly in embedding space, all referencing each other. That signal doesn't require reading the content at all. Architectural separation limits blast radius after retrieval. Ingestion anomaly detection reduces the probability of the poisoned document entering the collection in the first place. Both layers matter and they address different parts of the problem.


Replies

bandramitoday at 9:16 AM

But at that point it just becomes yet another escape sequence game; there's not really a solution here given that by design we only have one band to communicate with.

hobstoday at 12:55 AM

I mean, its just SQL injection all over again, if your method of communication can be escaped, it will.