Related: I built a translation app[0]* for language pairs that are not traditionally supported by Google Translate or DeepL (Moroccan Arabic with a dozen of other major languages), and also trained a custom translation model for it - a BART encoder/decoder derivative, using data I collected, curated and corrected from scratch, and then I built a continuous training pipeline for it taking people's corrections into account.
Happy to answer questions if anyone is interested in building translation models for low-resource languages, without being a GPT wrapper. A great resource for this is Marian-NMT[1] and the Opus & Tatoeba projects (beware of data quality).
* Unfortunately not functioning right now due to inference costs for the model, but I plan to launch it sometime soon.
How big are the models that you use/built? Can't you run them on the browser?
Asking because I built a translator app[0] for Android, using marian-nmt (via bergamot), with Mozilla's models, and the performance for on-device inference is very good.
Not sure if you tried that already, but ctranslate2 can run BART and MarianNMT models quite efficiently, also without GPUs.
Any major challenges beyond gathering high-quality sentence pairs? Did the Marian training recipes basically work as-is? Any special processing needed for Arabic compared to Latin-script-based languages?
How does the "continuous training pipeline" work? You rebuild the model after every N corrections, with the corrections included in the data?
> for language pairs that are not traditionally supported
Maybe translate X to English, and then to Y?
I'm curious how large your training corpus is and your process for dealing with data quality issues. Did you proofread everything manually or were you able to automate some parts?