logoalt Hacker News

rldjbpintoday at 9:26 AM1 replyview on HN

may not be the most efficient way to go about things, but there remains a seemingly obvious use case for non-latin languages to do things from scratch.

see sarvam.ai and their tokenisation improvements on local languages [1]. not every llm needs to help with coding, nor it needs to already become Babel fish.

language is culture, so i can see the motivation behind their initiative. it must be nice to afford to do this yourself.

[1] https://www.sarvam.ai/blogs/sarvam-30b-105b


Replies

kgeisttoday at 9:55 AM

>but there remains a seemingly obvious use case for non-latin languages to do things from scratch

>see sarvam.ai and their tokenisation improvements on local languages

You don't need to build from scratch to improve tokenization, though.

Russia's T-Bank was able to increase generation speeds by 1.5-3x by changing a stock Qwen's tokenizer to include 5 times more Cyrillic tokens (+ post-training on a Russian corpus).