Interesting, and my main take away is that ~16 million sessions is enough to distill Claude. That's extremely doable - obviously, as it's been done repeatedly - but it just looks very feasible in general.
If I think of the number of lessons and educational conversations that a human would have to acquire their lifetime knowledge, I would hazard to say that AI-to-AI learning no longer requires many orders of magnitude beyond that.
I wonder if more companies from different countries would get interested into Distillation efforts.
Because a huge downside of Chinese-models is that these are chinese models with tianmen square and tibet and other issues.
Yet everyone uses them because they thought that it was insanely hard to build and obviously I am not trying to downplay that even now its an incredible accomplishment that they achieve by created such good open source models and providing them at competitive rates.
Now that we know it might be (more?) easier than previously thought. Would more countries, say South Korea, Japan or India want to enter the market as well without much bias on certain topics which are raised about Chinese censorship everytime a new model is discussed at times.
It's a huge risk/rewards ratio thing. From what I can tell, inference is extremely profitable (Deepseek was profitable at inference fwiw) so perhaps, more countries could try to create their own "Deepseek" where they would focus on having a brand value + open-source/selling for entreprise.
Mistral is a good example of that especially with their entreprise related contracts. Speaking of mistral, are they doing distillation too or not