logoalt Hacker News

omneityyesterday at 10:16 PM0 repliesview on HN

I started seeing results as early as 5-10k pairs, but you want something closer to 100k, especially if the language has a lot of variations (aka morphologically rich, agglutinative, or written in a non-standardized way).

Manual proof-reading (and data generation) was a big part of it, it's definitely not a glamorous magic process. But as I went through it I could notice patterns and wrote some tools to help.

There's a way to leverage LLMs to help with this if your language is supported (my target wasn't at the time), but I still strongly recommend a manual review part. That's really the secret sauce and no way around it if you're serious about the translation quality of your model.