logoalt Hacker News

tgvlast Friday at 6:05 PM2 repliesview on HN

It's a rather old-fashioned style of tokenization. In the 1980s this was common, I think. But, as noted in another comment, it doesn't work that well for languages with a richer morphology, or compounding. It's a very "English" approach.


Replies

empikolast Friday at 7:43 PM

This was common even in 2015. You can still see people removing stop words from text, even when they feed it to LLMs. It's of course terrible for performance, but old habits die hard I guess.

jamesgresqllast Friday at 6:31 PM

Chinese, Japanese, Korean etc.. don’t work like this either.

However, even though the approach is “old fashioned” it’s still widely used for English. I’m not sure there is a universal approach that semantic search could use that would be both fast and accurate?

At the end of the day people choose a tokenizer that matches their language.

I will update the article to make all this clearer though!