Related: I built a translation app[0]* for language pairs that are not traditionally supported by Go...

omneity • 05/03/2025 • 6 replies • view on HN

Related: I built a translation app[0]* for language pairs that are not traditionally supported by Google Translate or DeepL (Moroccan Arabic with a dozen of other major languages), and also trained a custom translation model for it - a BART encoder/decoder derivative, using data I collected, curated and corrected from scratch, and then I built a continuous training pipeline for it taking people's corrections into account.

Happy to answer questions if anyone is interested in building translation models for low-resource languages, without being a GPT wrapper. A great resource for this is Marian-NMT[1] and the Opus & Tatoeba projects (beware of data quality).

0: https://tarjamli.ma

* Unfortunately not functioning right now due to inference costs for the model, but I plan to launch it sometime soon.

1: https://marian-nmt.github.io

Replies

yorwba • 05/03/2025

I'm curious how large your training corpus is and your process for dealing with data quality issues. Did you proofread everything manually or were you able to automate some parts?

➕ show 1 reply

deivid • 05/03/2025

How big are the models that you use/built? Can't you run them on the browser?

Asking because I built a translator app[0] for Android, using marian-nmt (via bergamot), with Mozilla's models, and the performance for on-device inference is very good.

[0]: https://github.com/DavidVentura/firefox-translator

➕ show 1 reply

ks2048 • 05/03/2025

Any major challenges beyond gathering high-quality sentence pairs? Did the Marian training recipes basically work as-is? Any special processing needed for Arabic compared to Latin-script-based languages?

➕ show 1 reply

philomath868 • 05/03/2025

How does the "continuous training pipeline" work? You rebuild the model after every N corrections, with the corrections included in the data?

➕ show 1 reply

woodson • 05/04/2025

Not sure if you tried that already, but ctranslate2 can run BART and MarianNMT models quite efficiently, also without GPUs.

alt Hacker News

Replies