logoalt Hacker News

WatchDogyesterday at 10:42 PM10 repliesview on HN

If you want LLMs to have knowledge of the Norwegian language, wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available? Why go to the expense of training your own model, especially when it will be inferior to state of the art models.


Replies

black_puppydogyesterday at 11:58 PM

I task GPT/Claude with researching stuff that pertains to very specific cultural or legal aspects in French politics, on a daily basis. Even though French is a way more common language globally than Norwegian, these models still haven't figured out that, no matter the language I myself speak to them (German or English depending on my mood) their web searches need to be done in French to return reasonable results. I have to remind them every time lest they come back with "uh, didn't find anything relevant, here take some hallucinations instead."

So, given the anglo-centrism of current models, my confidence in American providers giving any shits about non-american users/use-cases is pretty low. And lower the smaller the language community is.

show 5 replies
a2128yesterday at 11:45 PM

What incentives does OpenAI have to make sure the AI actually works well with Norwegian beyond capturing a (small) Norwegian market? What incentives do they have to take Norwegian values into consideration, or to preserve Norwegian culture into the future? The matter is also a question of national sovereignty, so to simply release the data and nicely ask foreign companies to solve the problem for you, would be a fool's move

show 1 reply
embedding-shapeyesterday at 11:23 PM

Yeah, was about to comment that too, instead of training a new model and new weights exclusively for Norwegian (and expecting/wanting every other small/medium-sized country to do the same) which seems infinity harder, they could have made high quality transcriptions and translations of the stories currently described only in Norwegian into English, and making it all public. I guess there still would be a worry that it'd be counted as "less important" compared to other history, news and culture about other countries.

show 3 replies
onion2ktoday at 9:19 AM

wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available?

Only if you believe other people will value that enough to expend the effort necessary to use it. If you believe other people will see it as low value and ignore it then you'd be better off doing the training yourself in order to guarantee it happens.

There's also a secondary benefit that your team doing the work will learn some useful skills while they do it.

electroglyphyesterday at 11:16 PM

absolutely. somebody online was wanting an LLM with Georgian language support, and that's exactly what i suggested: start digitizing Georgian text.

blkstoday at 7:01 AM

Because state of the art models are owned and controlled by foreign agents.

gizajobtoday at 8:37 AM

Because you have so much money you don’t know what to do with it any more.

vintermanntoday at 4:21 AM

Permissions, probably. Copyrights and statutes. Knowing the librarians, unfortunately the prestige of their job is more vested in denying you access than giving you access.

I mean it's their job to give people access to information, and they certainly do, but the mark of a professional, in their eyes, is guarding information. It's much more embarrassing for them professionally to give too much access than too little.

LLM training gives them a "respectable" way of bypassing that and give the world their information (which, in fairness, they probably all really want to do if they could).

show 1 reply
fransje26today at 11:11 AM

> Why go to the expense of training your own model, especially when it will be inferior to state of the art models.

Uuh.. No? Especially of the training data, as in this case, is of better quality.

_cs2017_today at 12:26 AM

> Why go to the expense...

Answer: idiocy of decision makers and the desire to get resources by those who created the proposal.

I assumed Scandinavia has better decision processes but apparently I was wrong.