logoalt Hacker News

wongarsulast Friday at 12:44 PM2 repliesview on HN

Notably tokenization for traditional search. LLMs use very different tokenization with very different goals


Replies

tgvlast Friday at 6:05 PM

It's a rather old-fashioned style of tokenization. In the 1980s this was common, I think. But, as noted in another comment, it doesn't work that well for languages with a richer morphology, or compounding. It's a very "English" approach.

show 2 replies
jamesgresqllast Friday at 6:25 PM

100%, maybe we should do a follow up on other types of tokenization.