Chinese, Japanese, Korean etc.. don’t work like this either.
However, even though the approach is “old fashioned” it’s still widely used for English. I’m not sure there is a universal approach that semantic search could use that would be both fast and accurate?
At the end of the day people choose a tokenizer that matches their language.
I will update the article to make all this clearer though!