logoalt Hacker News

acutesoftwareyesterday at 11:09 PM2 repliesview on HN

This highlights that all RAG systems should be using metadata embedded into each of the vectorstores. Any result from the LLM needs to have a link to a document / chunk - which is turn links to a 'source file' which (should) have the file system owners id or another method of linking to a person.

If the 'source information' cannot be linked to a person in the organisation, then it doesnt really belong in the RAG document store as authorative information.


Replies

hrmtst93837today at 10:44 AM

Embedding owner metadata and file origin helps, but relying on it as a cure-all is risky. Attackers aiming to poison your RAG are just as happy to phish an employee or exploit public-facing sources with legitimate owner signatures. Corporate directory info and source attribution can still be faked or compromised, so provenance is not the same as integrity. If you treat any document with a valid owner field as authoritative, you are still one social engineering email away from junk in your knowledge base.

salawatyesterday at 11:46 PM

But you can't do that. That would implicitly out where the knowledge came from, and we all know that the AI industry has an existential incapability to actually cope with that little turd. Might work great for data you actually own, got access to. Imagine that applied back to the latent space of LLM's though. Plus, wouldn't all of that eat through context window like no tomorrow?

show 1 reply