logoalt Hacker News

magicalhippoyesterday at 4:05 PM0 repliesview on HN

From what I've gathered, they've been mostly training limited. Better training methods and cleaner training data allows smaller models to rival or outperform larger models training with older methods and lower-quality training data.

For example, the Qwen3 technical report[1] says that the Qwen3 models are architecturally very similar to Qwen2.5, with the main change being a tweak in the attention layers to stabilize training. And if you compare table 1 in Qwen3 paper with table 1 in Qwen 2.5 technical report[2], the layer count, attention configuration and such is very similar. Yet Qwen3 was widely regarded as a significant upgrade to Qwen2.5.

However, for training, they doubled the pre-training token count, and tripled the number of languages. It's been shown that training on more languages can actually help LLMs generalize better. They used Qwen2.5 VL and Qwen 2.5 to generate additional training data by parsing a large number PDFs and turning them into high quality training tokens. They improved their annotation so they could more effectively provide diverse training tokens to the model, improving training efficiency.

They continued this trend with Qwen3.5, where even more and better training data[3] made their Qwen3.5-397B-A17B model match the 1T-parameter Qwen3-Max-Base.

That said there's also been a lot of work on model architecture[4], getting more speed and quality per parameter. In the case of Qwen3-Next architecture which 3.5 is based on, that means such things as hybrid attention for faster long-context operation, and sparse MoE and multi-token prediction for less compute per output token.

I used Qwen as an example here, from what I gather they're just an example of the general trend.

[1]: https://arxiv.org/abs/2505.09388

[2]: https://arxiv.org/abs/2412.15115

[3]: https://qwen.ai/blog?id=qwen3.5

[4]: https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d...