logoalt Hacker News

txhwindtoday at 2:52 AM0 repliesview on HN

Could you share some latest articles or papers comparing both methods, especially on lanuage modelling case? I was not conviced by this claim when reading the original Knowledge Distillation paper. ChatGPT said there are some later works showing: 1. the gain may come from label smoothing; 2. soft logits are more meaningful for students much smaller than teacher.