Any major challenges beyond gathering high-quality sentence pairs? Did the Marian training recipes basically work as-is? Any special processing needed for Arabic compared to Latin-script-based languages?
Marian was a good starting point and allowed me to iterate faster when I first started, but I quickly found it a bit limiting as it performs better for single pairs.
My goal was a google translate style multilingual translation model, and for that the BART architecture proved ultimately to be better because you benefit from cross-language transfer learning. If your model learns the meaning of "car" in language pair (A, B), and it knows it in language (B, C), then it will perform decently when you ask it to translate between A and C. It compounds very quickly the more you add language pairs.
One big limitation of BART (where LLMs become more attractive) is that it becomes extremely slow for longer sentences, and is less good at understanding and translating complex sentences.
> Any special processing needed for Arabic compared to Latin-script-based
Yes indeed, quite a lot. Especially for Moroccan Arabic which is written in both Arabic and Latin scripts (I made sure to support both and they're aligned in the model's latent space). For this I developed semantic and phonetic embedding models along the way that helped a lot. I am in the process of publishing a paper on the phonetic processing aspect, if you're interested let's stay in touch and I'll let you know when it's out.
But beyond the pre-processing and data pipeline, the model itself didn't need any special treatment besides the tokenizer.
Marian was a good starting point and allowed me to iterate faster when I first started, but I quickly found it a bit limiting as it performs better for single pairs.
My goal was a google translate style multilingual translation model, and for that the BART architecture proved ultimately to be better because you benefit from cross-language transfer learning. If your model learns the meaning of "car" in language pair (A, B), and it knows it in language (B, C), then it will perform decently when you ask it to translate between A and C. It compounds very quickly the more you add language pairs.
One big limitation of BART (where LLMs become more attractive) is that it becomes extremely slow for longer sentences, and is less good at understanding and translating complex sentences.
> Any special processing needed for Arabic compared to Latin-script-based
Yes indeed, quite a lot. Especially for Moroccan Arabic which is written in both Arabic and Latin scripts (I made sure to support both and they're aligned in the model's latent space). For this I developed semantic and phonetic embedding models along the way that helped a lot. I am in the process of publishing a paper on the phonetic processing aspect, if you're interested let's stay in touch and I'll let you know when it's out.
But beyond the pre-processing and data pipeline, the model itself didn't need any special treatment besides the tokenizer.