I'm curious how large your training corpus is and your process for dealing with data quality is...

yorwba • 05/03/2025 • 1 reply • view on HN

I'm curious how large your training corpus is and your process for dealing with data quality issues. Did you proofread everything manually or were you able to automate some parts?

Replies

omneity • 05/03/2025

I started seeing results as early as 5-10k pairs, but you want something closer to 100k, especially if the language has a lot of variations (aka morphologically rich, agglutinative, or written in a non-standardized way).

Manual proof-reading (and data generation) was a big part of it, it's definitely not a glamorous magic process. But as I went through it I could notice patterns and wrote some tools to help.

There's a way to leverage LLMs to help with this if your language is supported (my target wasn't at the time), but I still strongly recommend a manual review part. That's really the secret sauce and no way around it if you're serious about the translation quality of your model.

alt Hacker News

Replies