The bitter-er lesson is that distillation from bigger models works pretty damn well. It’s great news...

janalsncm • yesterday at 7:38 AM • 0 replies • view on HN

The bitter-er lesson is that distillation from bigger models works pretty damn well. It’s great news for the GPU poor, not great for the guys training the models we distill from.

alt Hacker News