logoalt Hacker News

raframtoday at 2:54 AM7 repliesview on HN

> Marius Husnes, the Head of IT Platform at the library (Nasjonlbiblioteket) discussed the project at Huawei’s ID Forum 2026 in Paris, saying that no commercial LLM provider was developing a local (Norwegian) language LLM. He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.

I am not overly confident that Marius Husnes knows what he’s talking about here.


Replies

fnordpiglettoday at 4:30 AM

He’s right though, although it’s not entirely about the training corpus. It’s about the tokenizer that tokenizes substrings more efficiently based on a necessary bias towards a target language. English oriented LLMs are more powerful for English than other languages because the token space is more parsimonious in English language. Try any online Anthropic tokenizer that calls their api with common English words (typically one or fewer tokens) and Norwegian words - you’ll often see 2-4 tokens instead sometimes more. Some languages like Thai are at a huge disadvantage. Likewise often the corpus selection also is heavily skewed towards the target language simply because more energy is applied to sourcing written works in that language. There will also be semantic biases in the vector space due to cross influence between semantically similar embeddings between languages that create a different than cultural baseline. Finally fine tuning greatly impacts cultural expression in the LLM. None of these are trivial effects.

There are a lot of efforts to create LLMs for dying languages and others that use cross cultural models to boost, but if your language is well literate, there’s a good reason to build a heritage LLM specific to your language and culture. Expecting OpenAI or Anthropic to prioritize your language over their target audience when a tradeoff is to be made is absurd.

show 2 replies
chvidtoday at 4:12 AM

When I am chatting with ChatGPT - it is fairly obvious that it is American - its native language, its style, its attitude is American - even if we chat in Danish.

Just as we cannot rely on Netflix and HBO to produce Scandinavian TV-shows even though they might do at the moment, we need to make our own stuff in this area too.

And over time, the technology to do this will become cheap and readily available for us to do so.

show 1 reply
isawczuktoday at 5:24 AM

Poland have its one LLM called Bielik. It's not only better in preserving Polish sounding wording, it's also better in writing government documents. Why better? They did arena and statistically it's just better.

KaiserProtoday at 7:10 AM

could you provide evidence to suggest he is wrong?

It seems like you've made an assertion but not provided evidence. Why is it not a disadvantage to only have english LLMs?

Can you get the nuance of Norwegian history/culture with present models?

spiderfarmertoday at 3:10 AM

It sounds plausible enough to get subsidies.

maxlohtoday at 6:06 AM

[dead]

idiotsecanttoday at 3:36 AM

You're making the mistake of thinking whether he knows what he is talking about matters. He is brewing a potion. It's ingredients are a trendy term, a vaguely spooky threat and a clear, overly simplistic solution that of course he will graciously assume control of, for the good of the motherland.

This potion is potent and you'd think it would stop working from frequent misuse but you'd be wrong!

show 1 reply