Distilling from a larger model is not only probably cheaper than from data, it's also likely hi...

adgjlsfhk1 • today at 12:41 AM • 0 replies • view on HN

Distilling from a larger model is not only probably cheaper than from data, it's also likely higher quality. There's pretty strong support for the proposition that NNs learn a smoothed and regularized version of the data. The NNs are likely higher quality than most of the data they are training from.

alt Hacker News