may not be the most efficient way to go about things, but there remains a seemingly obvious use case for non-latin languages to do things from scratch.
see sarvam.ai and their tokenisation improvements on local languages [1]. not every llm needs to help with coding, nor it needs to already become Babel fish.
language is culture, so i can see the motivation behind their initiative. it must be nice to afford to do this yourself.
>but there remains a seemingly obvious use case for non-latin languages to do things from scratch
>see sarvam.ai and their tokenisation improvements on local languages
You don't need to build from scratch to improve tokenization, though.
Russia's T-Bank was able to increase generation speeds by 1.5-3x by changing a stock Qwen's tokenizer to include 5 times more Cyrillic tokens (+ post-training on a Russian corpus).