Yeah I wouldn't get too excited. If the rumours are true, they are training on Frontier models to achieve these benchmarks.
I think this is the case for almost all of these models - for a while kimi k2.5 was responding that it was claude/opus. Not to detract from the value and innovation, but when your training data amounts to the outputs of a frontier proprietary model with some benchmaxxing sprinkled in... it's hard to make the case that you're overtaking the competition.
The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.
edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:
https://docs.google.com/document/d/1zrX8L2_J0cF8nyhUwyL1Zri9...
Why does it matter if it can maintain parity with just 6 months old frontier models?
If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.
If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.
Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.
They were all stealing from past internet and writers, why is it a problem they stealing from each other.