How true is this statement: "He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language."
I thought all big players already train on basically everything remotely available to them no matter the language or quality, so his take sounds like an opinion formed in the early days of generally available LLMs.
Foreign LLMs are probably not trained on the Norwegian National Library. I regularly find things in there (with regular keyword search, for genealogy) which neither search engines or language models know.
Of course I then usually put the information I'm interested in somewhere AI could scrape it. But it would take a long, long time to get everything interesting out of there.
Not remotely true in my estimation. I don't really speak Norwegian, but I do speak Swedish(which means I mostly understand Norwegian as they're very similar). Every model I've tried speaking Swedish to does it perfectly. I'd be surprised if the same isn't true for Norwegian already
Maybe it can at least write like a Norwegian instead of just English-translated-into-Norwegian. It would be interesting to see if they try something like the experiments in https://arxiv.org/pdf/2507.22445 on it.
Current-best models are pretty fluent at major languages and cultures, so it's untrue at least for the "any" qualifier. Performance is barely affected or might be even better sometimes. However English patterns can subtly leak into native patterns of other languages. It's obviously very different for low-resource languages, but to improve them you need more data, not a new model.
It's probably just an excuse to play with LLMs using big government funding :)
yeah and alignment is all about how to be less evil which is no easy job... I can just imagine Chinese LLM renders 1989 tianmen square as an incident orchestrated by CIA which CCP successfully thwarted etc etc
Quite true ?
English is ludicrously over abundant in training when compared to any language.
As the article explains, Norway's National Library has a database of practically everything published and broadcast in Norwegian going back many decades. From the way the dataset described in the article, it does not sound like OpenAI et al. would have easy access to it in its entirety.
If you want LLMs to have knowledge of the Norwegian language, wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available? Why go to the expense of training your own model, especially when it will be inferior to state of the art models.