The non-hallucination rate in AA-omniscience is SOTA, better than Opus 4.7, Gemini 3.1 Pro and GPT5.5! Congrats to the team
referencing this:
https://artificialanalysis.ai/evaluations/omniscience?models...
(had to add it to the chart, wasn't displayed by default. is it the lowest rate in the datasetor no?)
> The non-hallucination rate in AA-omniscience is SOTA
Note that a perfect "non-hallucination rate" is rather meaningless as such tests can contain human hallucinations.
It means the model aligns with the possibly-true, possibly-false beliefs of the group that made the test.
Truly incredible! Very impressed by their progress. I wonder how much of their own chips did they use for training.
wonder at which level there's a capability state transition? 5%? 1%?
The big question for me having used a lot of these SOTA chinese models is: what is its token efficiency like?
Running Step 3.5 Flash locally for example, it's an amazingly capable model all things considered, but it's token efficiency is so bad that it gets out performed by most others wall-clock time (even with my MTP-support for it hacked in to llama.cpp: despite being trained on three heads, MTP 2 is the sweet spot, and only gets it from 20tk/s to 30tk/s on my Spark)
The DeepSeek models and Qwen 3.5 Plus are also good examples of this: compared to Opus, and especially GPT 5.5 they use many more tokens to get to the same answers.
I'm really hoping that Qwen 3.7 is better in this regard, can't wait to try it out
(ps. running DeepSeek v4 Flash on my Spark is absolutely wild, thanks antirez if you see this haha)