logoalt Hacker News

loire280last Wednesday at 9:07 PM3 repliesview on HN

They don't claim to support Polish, but they do support Russian.

> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.

I wonder how much having languages with the same roots (e.g. the romance languages in the list above or multiple Slavic languages) affects the parameter count and the training set. Do you need more training data to differentiate between multiple similar languages? How would swapping, for example, Hindi (fairly distinct from the other 12 supported languages) for Ukrainian and Polish (both share some roots with Russian) affect the parameter count?


Replies

MarcelOlszlast Wednesday at 9:38 PM

Nobody ever supports Polish. It's the worst. They'll support like, ̵Swahili, but not Polish.

edit: I stand corrected lol. I'll go with "Gaelic" instead.

show 2 replies
_ache_yesterday at 12:52 AM

Just a side note to remember that this is a mini model. It's very small and yet 12 languages.

I guess a European version can be created but now it's aimed at a world wide distribution.

sbinneeyesterday at 8:11 AM

I guess I will check Korean. OpenAI audio mini is not bad but I always have to make gpt to check and fix transcription.