logoalt Hacker News

softwaredouglast Friday at 5:41 PM17 repliesview on HN

The real thing I think people are rediscovering with file system based search is that there’s a type of semantic search that’s not embedding based retrieval. One that looks more like how a librarian organizes files into shelves based on the domain.

We’re rediscovering forms of in search we’ve known about for decades. And it turns out they’re more interpretable to agents.

https://softwaredoug.com/blog/2026/01/08/semantic-search-wit...


Replies

neuzhoulast Friday at 10:24 PM

Agreed. I've been working on a codebase with 400+ Python files and the difference is stark. With embedding-based RAG, the agent kept pulling irrelevant code snippets that happened to share vocabulary. Switched to just letting the agent browse the directory tree and read files on demand -- it figured out the module structure in about 30 seconds and started asking for the right files by path.

The directory hierarchy is already a human-curated knowledge graph. We just forgot that because we got excited about vector math.

show 4 replies
wielebnylast Friday at 5:56 PM

Someone simply assumed at some point that RAG must be based on vector search, and everyone followed.

show 7 replies
andaiyesterday at 11:43 AM

I spent a while working on a retrieval system for LLMs and ended up reinventing a concordance (which is like an index).

It's basically the same thing as Google's inverted index, which is how Google search works.

Nothing new under the sun :)

woahlast Friday at 9:28 PM

My intuition is that since AI assistants are fictional characters in a story being autocompleted by an LLM, mechanisms that are interpretable as human interactions with language and appear in the pretraining data have a surprising advantage over mechanisms that are more like speculation about how the brain works or abstract concepts.

show 1 reply
manunamzyesterday at 3:47 PM

Exactly. Traditional library science truly captured deep patterns of information architecture.

https://x.com/wibomd/status/1818305066303910006

Pixar got this right in Ralph Wrecks The Internet.

https://x.com/wibomd/status/1827067434794127648

czhu12last Friday at 6:56 PM

Similar effort with PageIndex [1], which basically creates a table of contents like tree. Then an LLM traverses the tree to figure out which chunks are relevant for the context in the prompt.

1: https://github.com/VectifyAI/PageIndex

khaliclast Friday at 5:57 PM

This kind of circles back to ontological NLP, that was using knowledge representation as a primitive for language processing. There is _a ton_ of work in that direction.

show 1 reply
stingraycharlesyesterday at 12:54 PM

Aren’t most successful RAGs using a combination of embedding similarity + BM25 + reranking? I thought there were very few RAGs that only did pure embedding similarity, but I may be mistaken.

siva7yesterday at 2:26 AM

> Our documentation was already indexed, chunked, and stored in a Chroma database to power our search, so we built ChromaFs

It's obvious by that sentence that these guys neither understand RAG nor realized that the solution to their agentic problem didn't need any of this further abstractions including vector or grep

rao-vyesterday at 12:40 AM

I got to say people also seem to be missing really simple tricks with RAG that help. Using longer chunks and appending the file path to the chunk makes a big difference.

Having said that, generally agree that keyword searching via rg and using the folder structure is easier and better.

show 2 replies
skeptrunelast Friday at 6:26 PM

I think it's cool that LLMs can effectively do this kind of categorization on the fly at relatively large scale. When you give the LLM tools beyond just "search", it really is effectively cheating.

babyyesterday at 1:52 AM

Yep, I was using RAG for all sorts of stuff and now moved everything to just rg+fd+cd+ls, much faster, easier, etc.

_boffin_last Friday at 9:41 PM

And next, we’ll get to tag based file systems

UltraSanelast Friday at 5:58 PM

Inverted indexes have the major advantages of supporting Boolean operators.

risyachkayesterday at 11:03 AM

more and more often you see "new discoveries" that are very old concepts. the only discovery that usually happens there is that the author discovers for himself this concept. but it is essential nowadays to post it like if you discovered something new

whattheheckhecklast Friday at 5:55 PM

Turns out the millions of people in knowledge work arent librarians and they wing shit everywhere