The real thing I think people are rediscovering with file system based search is that there’s a type...

softwaredoug • last Friday at 5:41 PM • 17 replies • view on HN

The real thing I think people are rediscovering with file system based search is that there’s a type of semantic search that’s not embedding based retrieval. One that looks more like how a librarian organizes files into shelves based on the domain.

We’re rediscovering forms of in search we’ve known about for decades. And it turns out they’re more interpretable to agents.

https://softwaredoug.com/blog/2026/01/08/semantic-search-wit...

Replies

neuzhou • last Friday at 10:24 PM

Agreed. I've been working on a codebase with 400+ Python files and the difference is stark. With embedding-based RAG, the agent kept pulling irrelevant code snippets that happened to share vocabulary. Switched to just letting the agent browse the directory tree and read files on demand -- it figured out the module structure in about 30 seconds and started asking for the right files by path.

The directory hierarchy is already a human-curated knowledge graph. We just forgot that because we got excited about vector math.

➕ show 4 replies

wielebny • last Friday at 5:56 PM

Someone simply assumed at some point that RAG must be based on vector search, and everyone followed.

➕ show 7 replies

andai • yesterday at 11:43 AM

I spent a while working on a retrieval system for LLMs and ended up reinventing a concordance (which is like an index).

It's basically the same thing as Google's inverted index, which is how Google search works.

Nothing new under the sun :)

woah • last Friday at 9:28 PM

My intuition is that since AI assistants are fictional characters in a story being autocompleted by an LLM, mechanisms that are interpretable as human interactions with language and appear in the pretraining data have a surprising advantage over mechanisms that are more like speculation about how the brain works or abstract concepts.

➕ show 1 reply

manunamz • yesterday at 3:47 PM

Exactly. Traditional library science truly captured deep patterns of information architecture.

https://x.com/wibomd/status/1818305066303910006

Pixar got this right in Ralph Wrecks The Internet.

https://x.com/wibomd/status/1827067434794127648

czhu12 • last Friday at 6:56 PM

Similar effort with PageIndex [1], which basically creates a table of contents like tree. Then an LLM traverses the tree to figure out which chunks are relevant for the context in the prompt.

1: https://github.com/VectifyAI/PageIndex

khalic • last Friday at 5:57 PM

This kind of circles back to ontological NLP, that was using knowledge representation as a primitive for language processing. There is _a ton_ of work in that direction.

➕ show 1 reply

stingraycharles • yesterday at 12:54 PM

Aren’t most successful RAGs using a combination of embedding similarity + BM25 + reranking? I thought there were very few RAGs that only did pure embedding similarity, but I may be mistaken.

siva7 • yesterday at 2:26 AM

> Our documentation was already indexed, chunked, and stored in a Chroma database to power our search, so we built ChromaFs

It's obvious by that sentence that these guys neither understand RAG nor realized that the solution to their agentic problem didn't need any of this further abstractions including vector or grep

rao-v • yesterday at 12:40 AM

I got to say people also seem to be missing really simple tricks with RAG that help. Using longer chunks and appending the file path to the chunk makes a big difference.

Having said that, generally agree that keyword searching via rg and using the folder structure is easier and better.

➕ show 2 replies

skeptrune • last Friday at 6:26 PM

I think it's cool that LLMs can effectively do this kind of categorization on the fly at relatively large scale. When you give the LLM tools beyond just "search", it really is effectively cheating.

baby • yesterday at 1:52 AM

Yep, I was using RAG for all sorts of stuff and now moved everything to just rg+fd+cd+ls, much faster, easier, etc.

_boffin_ • last Friday at 9:41 PM

And next, we’ll get to tag based file systems

UltraSane • last Friday at 5:58 PM

Inverted indexes have the major advantages of supporting Boolean operators.

risyachka • yesterday at 11:03 AM

more and more often you see "new discoveries" that are very old concepts. the only discovery that usually happens there is that the author discovers for himself this concept. but it is essential nowadays to post it like if you discovered something new

whattheheckheck • last Friday at 5:55 PM

Turns out the millions of people in knowledge work arent librarians and they wing shit everywhere

alt Hacker News

Replies