logoalt Hacker News

lambdayesterday at 5:58 PM2 repliesview on HN

Distillation isn't only between different labs.

A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.

I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.


Replies

bandramitoday at 6:15 AM

I think the idea is you sink the pretraining costs once and then you can distill multiple specialized models from that

spwa4yesterday at 6:32 PM

There used to be training methods like that but I think they've been phased out in favor of letting small models evolve by rewriting their own training material. Surprisingly that's actually cheaper.