Notably tokenization for traditional search. LLMs use very different tokenization with very differen...

wongarsu • last Friday at 12:44 PM • 2 replies • view on HN

Notably tokenization for traditional search. LLMs use very different tokenization with very different goals

Replies

It's a rather old-fashioned style of tokenization. In the 1980s this was common, I think. But, as noted in another comment, it doesn't work that well for languages with a richer morphology, or compounding. It's a very "English" approach.

➕ show 2 replies

jamesgresql • last Friday at 6:25 PM

100%, maybe we should do a follow up on other types of tokenization.

alt Hacker News

Replies