logoalt Hacker News

omneityyesterday at 11:06 PM0 repliesview on HN

Yes. There's a scoring and filtering pipeline first, whereby I try to automatically check for the quality of the correction using a custom multilingual embedding model, madmon[0] and language identification model, gherbal[1]. Above a certain similarity threshold it goes into the training dataset, below it it's flagged for human review. This is mostly to stave off the trolls or blatant mistakes.

For the continuous training itself, yes I simply continue training the model from the last checkpoint (cosine lr scheduler). I am considering doing a full retraining at some point when I collect enough data to compare with this progressive training.

Apologies for the poor links, it takes a lot of time to work on this let alone fully document everything.

0: https://api.sawalni.com/docs#tag/Embeddings

1: https://api.sawalni.com/docs#tag/Language-Identification