logoalt Hacker News

omneityyesterday at 8:16 PM6 repliesview on HN

Related: I built a translation app[0]* for language pairs that are not traditionally supported by Google Translate or DeepL (Moroccan Arabic with a dozen of other major languages), and also trained a custom translation model for it - a BART encoder/decoder derivative, using data I collected, curated and corrected from scratch, and then I built a continuous training pipeline for it taking people's corrections into account.

Happy to answer questions if anyone is interested in building translation models for low-resource languages, without being a GPT wrapper. A great resource for this is Marian-NMT[1] and the Opus & Tatoeba projects (beware of data quality).

0: https://tarjamli.ma

* Unfortunately not functioning right now due to inference costs for the model, but I plan to launch it sometime soon.

1: https://marian-nmt.github.io


Replies

yorwbayesterday at 8:56 PM

I'm curious how large your training corpus is and your process for dealing with data quality issues. Did you proofread everything manually or were you able to automate some parts?

show 1 reply
deividyesterday at 10:27 PM

How big are the models that you use/built? Can't you run them on the browser?

Asking because I built a translator app[0] for Android, using marian-nmt (via bergamot), with Mozilla's models, and the performance for on-device inference is very good.

[0]: https://github.com/DavidVentura/firefox-translator

show 1 reply
woodsontoday at 3:25 AM

Not sure if you tried that already, but ctranslate2 can run BART and MarianNMT models quite efficiently, also without GPUs.

show 1 reply
ks2048yesterday at 9:10 PM

Any major challenges beyond gathering high-quality sentence pairs? Did the Marian training recipes basically work as-is? Any special processing needed for Arabic compared to Latin-script-based languages?

show 1 reply
philomath868yesterday at 10:46 PM

How does the "continuous training pipeline" work? You rebuild the model after every N corrections, with the corrections included in the data?

show 1 reply
WalterBrighttoday at 2:34 AM

> for language pairs that are not traditionally supported

Maybe translate X to English, and then to Y?

show 2 replies