You can look at Ukrainian LLM Lapa for inspiration:

oddmiral • yesterday at 6:57 AM • 0 replies • view on HN

https://huggingface.co/spaces/lapa-llm/lapa

Best tokenizer for the Ukrainian language

Thanks to a SOTA method for tokenizer adaptation developed by Mykola Haltiuk as part of this project, it was possible to replace 80,000 tokens out of 250,000 with Ukrainian ones without loss of model quality, thus making Lapa LLM the fastest model for working with the Ukrainian language. Compared to the original Gemma 3, for working with Ukrainian, the model requires 1.5 times fewer tokens, thus performing three times fewer computations to achieve better results.

alt Hacker News