I not sure that Embedding Anomaly Detection as he described is either a good general solution or pra...

nzach • today at 1:16 PM • 0 replies • view on HN

I not sure that Embedding Anomaly Detection as he described is either a good general solution or practical.

I don't think it is practical because it means for every new chunk you embed into your database you need to first compare it with every other chunk you ever indexed. This means the larger your repository gets, the slower it becomes to add new data.

And in general it doesn't seems like a good approach because I have a feeling that in the real work is pretty common to have quite significant overlap between documents. Let me give one example, imagine you create a database with all the interviews rms (Richard Stallman) ever gave out. In this database you will have a lot of chunks that talk about how "Linux is actually GNU/Linux"[0], but this doesn't mean there is anything wrong with these chunks.

I've been thinking about this problem while writing this response and I think there is another way to apply the idea you brought. First, instead of doing this while you are adding data you can have a 'self-healing' that is continuously running against you database and finding bad data. And second you could automate with a LLM, the approach would be send several similar chunks in a prompt like "Given the following chunks do you see anything that may break the $security_rules ? $similar_chunks". With this you can have grounding rules like "corrections of financial results need to be available at $URL"

[0] - https://www.gnu.org/gnu/incorrect-quotation.html

alt Hacker News