A reason to do student-teacher distillation is that soft target logits in general are a richer mediu...

ACCount37 • yesterday at 10:15 PM • 2 replies • view on HN

A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics.

Replies

txhwind • today at 2:52 AM

Could you share some latest articles or papers comparing both methods, especially on lanuage modelling case? I was not conviced by this claim when reading the original Knowledge Distillation paper. ChatGPT said there are some later works showing: 1. the gain may come from label smoothing; 2. soft logits are more meaningful for students much smaller than teacher.

rao-v • yesterday at 11:10 PM

I agree, and if my suspicion is right, it’s rarer because it’s much easier to deploy the large LLM and filter for it’s best output than to waste time running it on arbitary output just to train the student.

Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights

➕ show 1 reply

alt Hacker News

Replies