Distilled models are necessarily behind so long as models are progressing. Models are progressing. Maybe it will be over some time in the future.
And Berkeley’s “False Promise of Imitating Proprietary LLMs” found imitation closes the style gap fast but there is a large capability gap.
I'm ok with having last months model at a tiny fraction of the price.
Curiously, this isn't always true.
For example, GLM 5.1 is more capable at pentesting than the model from which it is alleged to have been distilled [1].
Intuitively, this makes some sense: you can "distill" from multiple frontier models, and you can further post-train the distilled model. But I'm not sure exactly what happened with GLM 5.1.
[1]: https://dualuse.dev/posts/chinese-models-are-sometimes-bette...