logoalt Hacker News

omneity05/03/20256 repliesview on HN

Related: I built a translation app[0]* for language pairs that are not traditionally supported by Google Translate or DeepL (Moroccan Arabic with a dozen of other major languages), and also trained a custom translation model for it - a BART encoder/decoder derivative, using data I collected, curated and corrected from scratch, and then I built a continuous training pipeline for it taking people's corrections into account.

Happy to answer questions if anyone is interested in building translation models for low-resource languages, without being a GPT wrapper. A great resource for this is Marian-NMT[1] and the Opus & Tatoeba projects (beware of data quality).

0: https://tarjamli.ma

* Unfortunately not functioning right now due to inference costs for the model, but I plan to launch it sometime soon.

1: https://marian-nmt.github.io


Replies

yorwba05/03/2025

I'm curious how large your training corpus is and your process for dealing with data quality issues. Did you proofread everything manually or were you able to automate some parts?

show 1 reply
deivid05/03/2025

How big are the models that you use/built? Can't you run them on the browser?

Asking because I built a translator app[0] for Android, using marian-nmt (via bergamot), with Mozilla's models, and the performance for on-device inference is very good.

[0]: https://github.com/DavidVentura/firefox-translator

show 1 reply
ks204805/03/2025

Any major challenges beyond gathering high-quality sentence pairs? Did the Marian training recipes basically work as-is? Any special processing needed for Arabic compared to Latin-script-based languages?

show 1 reply
philomath86805/03/2025

How does the "continuous training pipeline" work? You rebuild the model after every N corrections, with the corrections included in the data?

show 1 reply
woodson05/04/2025

Not sure if you tried that already, but ctranslate2 can run BART and MarianNMT models quite efficiently, also without GPUs.

show 1 reply
WalterBright05/04/2025

> for language pairs that are not traditionally supported

Maybe translate X to English, and then to Y?

show 2 replies