>but there remains a seemingly obvious use case for non-latin languages to do things from scratch...

kgeist • today at 9:55 AM • 1 reply • view on HN

>but there remains a seemingly obvious use case for non-latin languages to do things from scratch

>see sarvam.ai and their tokenisation improvements on local languages

You don't need to build from scratch to improve tokenization, though.

Russia's T-Bank was able to increase generation speeds by 1.5-3x by changing a stock Qwen's tokenizer to include 5 times more Cyrillic tokens (+ post-training on a Russian corpus).

Replies

rldjbpin • today at 11:45 AM

the improvements for sarvam was with the amount of tokens used to represent words in english vs non-english languages.

the great thing about the current momentum is that someone can test this hypothesis by applying the T-Bank approach to the same set of languages and compare outcomes.

unfortunately not everyone has the same level of respectable compute this easily available. at least those outside of the ZIRP/VC ecosystem of the valley.

alt Hacker News

Replies