It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level...

rao-v • yesterday at 7:27 PM • 6 replies • view on HN

It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.

The latter is much better (since you can clean up, review, update responses and filter your datasets).

I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)

Replies

teleforce • today at 1:48 AM

Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5].

Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].

I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.

[1]Self-Distillation Enables Continual Learning [pdf] (25 comments):

https://news.ycombinator.com/item?id=48165265

[2] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:

https://arxiv.org/abs/2601.18734

[4] Embarrassingly simple self-distillation improves code generation (201 comments):

https://news.ycombinator.com/item?id=47637757

[5] Embarrassingly Simple Self-Distillation Improves Code Generation:

https://arxiv.org/abs/2604.01193

ACCount37 • yesterday at 10:15 PM

A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics.

➕ show 2 replies

txhwind • today at 2:56 AM

I prefer synthetic dataset since the first day hearing distillation. The engineering friction is much lower than soft logits, and I have not observed or heard performance loss (in Speech and language area).

DoctorOetker • today at 2:23 AM

One may view pre-training as distillation.

The teacher distillation is a corpus of text, and the "next token after the context" would be looking-up the context in the corpus, and for each occurrence the label is what followed in the corpus, scaled down by the number of occurrences of the context. The teacher is moot on contexts outside of the corpus though, unlike the usual teacher model in distillation.

girvo • yesterday at 10:47 PM

> I suspect nobody is doing real student teacher distillation

It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though

➕ show 1 reply

thisisaman408 • yesterday at 10:12 PM

[dead]

alt Hacker News

Replies