>but there remains a seemingly obvious use case for non-latin languages to do things from scratch
>see sarvam.ai and their tokenisation improvements on local languages
You don't need to build from scratch to improve tokenization, though.
Russia's T-Bank was able to increase generation speeds by 1.5-3x by changing a stock Qwen's tokenizer to include 5 times more Cyrillic tokens (+ post-training on a Russian corpus).
the improvements for sarvam was with the amount of tokens used to represent words in english vs non-english languages.
the great thing about the current momentum is that someone can test this hypothesis by applying the T-Bank approach to the same set of languages and compare outcomes.
unfortunately not everyone has the same level of respectable compute this easily available. at least those outside of the ZIRP/VC ecosystem of the valley.