logoalt Hacker News

girvoyesterday at 10:51 PM1 replyview on HN

The big question for me having used a lot of these SOTA chinese models is: what is its token efficiency like?

Running Step 3.5 Flash locally for example, it's an amazingly capable model all things considered, but it's token efficiency is so bad that it gets out performed by most others wall-clock time (even with my MTP-support for it hacked in to llama.cpp: despite being trained on three heads, MTP 2 is the sweet spot, and only gets it from 20tk/s to 30tk/s on my Spark)

The DeepSeek models and Qwen 3.5 Plus are also good examples of this: compared to Opus, and especially GPT 5.5 they use many more tokens to get to the same answers.

I'm really hoping that Qwen 3.7 is better in this regard, can't wait to try it out

(ps. running DeepSeek v4 Flash on my Spark is absolutely wild, thanks antirez if you see this haha)


Replies

nltoday at 12:42 AM

Yes it's a big thing that people are slowly becoming more aware of.

Nvidia models are even worse than Qwen! https://sql-benchmark.nicklothian.com/#token-efficiency-and-... (mouse over the cells for token counts and click for traces)

Gemma 4 is good for this, as AA notes:

> Gemma 4 31B is notably token efficient, using 39M output tokens to run the Intelligence Index vs 98M for Qwen3.5 27B (Reasoning). This is ~2.5x fewer output tokens for a model scoring 3 points lower. For context, the other models at the 42-point intelligence level also use significantly more tokens: MiniMax-M2.5 (56M), DeepSeek V3.2 (Reasoning, 61M), and GLM-4.7 (Reasoning, 167M)

https://artificialanalysis.ai/articles/gemma-4-everything-yo...