How true is this statement: "He asserted that any country with its own language that did not ha...

KeplerBoy • yesterday at 10:11 PM • 9 replies • view on HN

How true is this statement: "He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language."

I thought all big players already train on basically everything remotely available to them no matter the language or quality, so his take sounds like an opinion formed in the early days of generally available LLMs.

Replies

WatchDog • yesterday at 10:42 PM

If you want LLMs to have knowledge of the Norwegian language, wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available? Why go to the expense of training your own model, especially when it will be inferior to state of the art models.

➕ show 9 replies

vintermann • today at 4:15 AM

Foreign LLMs are probably not trained on the Norwegian National Library. I regularly find things in there (with regular keyword search, for genealogy) which neither search engines or language models know.

Of course I then usually put the information I'm interested in somewhere AI could scrape it. But it would take a long, long time to get everything interesting out of there.

➕ show 1 reply

amarant • today at 12:11 AM

Not remotely true in my estimation. I don't really speak Norwegian, but I do speak Swedish(which means I mostly understand Norwegian as they're very similar). Every model I've tried speaking Swedish to does it perfectly. I'd be surprised if the same isn't true for Norwegian already

➕ show 5 replies

internet_points • today at 9:22 AM

Maybe it can at least write like a Norwegian instead of just English-translated-into-Norwegian. It would be interesting to see if they try something like the experiments in https://arxiv.org/pdf/2507.22445 on it.

orbital-decay • yesterday at 11:55 PM

Current-best models are pretty fluent at major languages and cultures, so it's untrue at least for the "any" qualifier. Performance is barely affected or might be even better sometimes. However English patterns can subtly leak into native patterns of other languages. It's obviously very different for low-resource languages, but to improve them you need more data, not a new model.

➕ show 1 reply

amelius • today at 8:25 AM

It's probably just an excuse to play with LLMs using big government funding :)

alliao • today at 12:39 AM

yeah and alignment is all about how to be less evil which is no easy job... I can just imagine Chinese LLM renders 1989 tianmen square as an incident orchestrated by CIA which CCP successfully thwarted etc etc

intended • today at 4:58 AM

Quite true ?

English is ludicrously over abundant in training when compared to any language.

➕ show 1 reply

DiogenesKynikos • today at 4:01 AM

As the article explains, Norway's National Library has a database of practically everything published and broadcast in Norwegian going back many decades. From the way the dataset described in the article, it does not sound like OpenAI et al. would have easy access to it in its entirety.

alt Hacker News

Replies