It seems that self-distillation is the way to go for LLM. Self-distillation has been shown recentl...

teleforce • today at 12:00 AM • 0 replies • view on HN

It seems that self-distillation is the way to go for LLM.

Self-distillation has been shown recently as very efficient and effective back in January this year by the team by MIT and ETH in their Self-Distillation Fine-Tuning (SDFT) LLM system [1],[2].

This paper is also their closest competitor named On-Policy Self-Distillation in the comparison table.

I hope they keep the original work real name that is Self-Distillation Fine-Tuning or SDFT. Imagine later paper citing this very paper as cross-entropy self-distillation instead of their very own given name Simple Self-Distallation or SSD. Although I'd have admitted it's a lousy name that breaks the namespace with common SSD nomenclature for solid-dtate drive, as others have rightly pointed.

I think they should given the proper credit to this earlier seminal earlier on SDFT but apparently they just put it as one as of the systems in their benchmark but not explaining much of the connection and lineage which is a big thing in research publication.

[1] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[2] Self-Distillation Enables Continual Learning:

https://self-distillation.github.io/SDFT.html

alt Hacker News