The real thing I think people are rediscovering with file system based search is that there’s a type of semantic search that’s not embedding based retrieval. One that looks more like how a librarian organizes files into shelves based on the domain.
We’re rediscovering forms of in search we’ve known about for decades. And it turns out they’re more interpretable to agents.
https://softwaredoug.com/blog/2026/01/08/semantic-search-wit...
Someone simply assumed at some point that RAG must be based on vector search, and everyone followed.
I spent a while working on a retrieval system for LLMs and ended up reinventing a concordance (which is like an index).
It's basically the same thing as Google's inverted index, which is how Google search works.
Nothing new under the sun :)
My intuition is that since AI assistants are fictional characters in a story being autocompleted by an LLM, mechanisms that are interpretable as human interactions with language and appear in the pretraining data have a surprising advantage over mechanisms that are more like speculation about how the brain works or abstract concepts.
Exactly. Traditional library science truly captured deep patterns of information architecture.
https://x.com/wibomd/status/1818305066303910006
Pixar got this right in Ralph Wrecks The Internet.
Similar effort with PageIndex [1], which basically creates a table of contents like tree. Then an LLM traverses the tree to figure out which chunks are relevant for the context in the prompt.
This kind of circles back to ontological NLP, that was using knowledge representation as a primitive for language processing. There is _a ton_ of work in that direction.
Aren’t most successful RAGs using a combination of embedding similarity + BM25 + reranking? I thought there were very few RAGs that only did pure embedding similarity, but I may be mistaken.
> Our documentation was already indexed, chunked, and stored in a Chroma database to power our search, so we built ChromaFs
It's obvious by that sentence that these guys neither understand RAG nor realized that the solution to their agentic problem didn't need any of this further abstractions including vector or grep
I got to say people also seem to be missing really simple tricks with RAG that help. Using longer chunks and appending the file path to the chunk makes a big difference.
Having said that, generally agree that keyword searching via rg and using the folder structure is easier and better.
I think it's cool that LLMs can effectively do this kind of categorization on the fly at relatively large scale. When you give the LLM tools beyond just "search", it really is effectively cheating.
Yep, I was using RAG for all sorts of stuff and now moved everything to just rg+fd+cd+ls, much faster, easier, etc.
And next, we’ll get to tag based file systems
Inverted indexes have the major advantages of supporting Boolean operators.
more and more often you see "new discoveries" that are very old concepts. the only discovery that usually happens there is that the author discovers for himself this concept. but it is essential nowadays to post it like if you discovered something new
Turns out the millions of people in knowledge work arent librarians and they wing shit everywhere
Agreed. I've been working on a codebase with 400+ Python files and the difference is stark. With embedding-based RAG, the agent kept pulling irrelevant code snippets that happened to share vocabulary. Switched to just letting the agent browse the directory tree and read files on demand -- it figured out the module structure in about 30 seconds and started asking for the right files by path.
The directory hierarchy is already a human-curated knowledge graph. We just forgot that because we got excited about vector math.