I've been thinking about sovereign AI a lot lately. About a year ago I was wondering what each country would be doing, and looking at places like e.g. Australia (which has pretty strict data residency laws for certain industries) - at that point I thought about advocating for why such countries should train their own models, but now I'm having a harder time justifying that point.
I can't see how any of these other countries could even approach the level of capability of the big three providers. I can imagine only a handful of countries who could even theoretically put enough resources towards reaching the SOTA frontier. Sure, even a model of capability level ~2024 has plenty of valid use cases today, but I'm concerned that people will just go with the big three because what they offer is still so so much better.
Not trying to discourage efforts like these, but is there really a good case for working on them? Or perhaps there's a state/national case, but it's harder for me to see a real business case.
India has a lot of languages and people need access to something than allows them to do basic stuff with it. I don't think relying on the US is a long term solution.
An example. I am into proofreading and language learning and am forced to rely on Claude/Gemini to extract text from old books because of the lack of good Indian models. I started with regular Tesseract, but its accuracy outside of the Latin alphabet is not that great. Qwen 3/3.5 is good with the Bombay style of Devanagari but craps the bed with the Calcutta style. And neither are great with languages like Bengali. In contrast, Claude can extract Bengali text from terrible scans and old printing with something like 99+ percent accuracy.
Models specifically targeted at Indian languages and content will perform better within that context, I feel.