logoalt Hacker News

KeplerBoyyesterday at 10:11 PM9 repliesview on HN

How true is this statement: "He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language."

I thought all big players already train on basically everything remotely available to them no matter the language or quality, so his take sounds like an opinion formed in the early days of generally available LLMs.


Replies

WatchDogyesterday at 10:42 PM

If you want LLMs to have knowledge of the Norwegian language, wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available? Why go to the expense of training your own model, especially when it will be inferior to state of the art models.

show 9 replies
vintermanntoday at 4:15 AM

Foreign LLMs are probably not trained on the Norwegian National Library. I regularly find things in there (with regular keyword search, for genealogy) which neither search engines or language models know.

Of course I then usually put the information I'm interested in somewhere AI could scrape it. But it would take a long, long time to get everything interesting out of there.

show 1 reply
amaranttoday at 12:11 AM

Not remotely true in my estimation. I don't really speak Norwegian, but I do speak Swedish(which means I mostly understand Norwegian as they're very similar). Every model I've tried speaking Swedish to does it perfectly. I'd be surprised if the same isn't true for Norwegian already

show 5 replies
internet_pointstoday at 9:22 AM

Maybe it can at least write like a Norwegian instead of just English-translated-into-Norwegian. It would be interesting to see if they try something like the experiments in https://arxiv.org/pdf/2507.22445 on it.

orbital-decayyesterday at 11:55 PM

Current-best models are pretty fluent at major languages and cultures, so it's untrue at least for the "any" qualifier. Performance is barely affected or might be even better sometimes. However English patterns can subtly leak into native patterns of other languages. It's obviously very different for low-resource languages, but to improve them you need more data, not a new model.

show 1 reply
ameliustoday at 8:25 AM

It's probably just an excuse to play with LLMs using big government funding :)

alliaotoday at 12:39 AM

yeah and alignment is all about how to be less evil which is no easy job... I can just imagine Chinese LLM renders 1989 tianmen square as an incident orchestrated by CIA which CCP successfully thwarted etc etc

intendedtoday at 4:58 AM

Quite true ?

English is ludicrously over abundant in training when compared to any language.

show 1 reply
DiogenesKynikostoday at 4:01 AM

As the article explains, Norway's National Library has a database of practically everything published and broadcast in Norwegian going back many decades. From the way the dataset described in the article, it does not sound like OpenAI et al. would have easy access to it in its entirety.