Distillation isn't only between different labs. A lab can train a large model, and then disti...

lambda • yesterday at 5:58 PM • 2 replies • view on HN

Distillation isn't only between different labs.

A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.

I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.

Replies

bandrami • today at 6:15 AM

I think the idea is you sink the pretraining costs once and then you can distill multiple specialized models from that

spwa4 • yesterday at 6:32 PM

There used to be training methods like that but I think they've been phased out in favor of letting small models evolve by rewriting their own training material. Surprisingly that's actually cheaper.

alt Hacker News

Replies