The bitter-er lesson is that distillation from bigger models works pretty damn well. It’s great news for the GPU poor, not great for the guys training the models we distill from.