This counts only incorrect answers though. A model can get 0% hallucination rate just by refusing to answer all questions.
I think that's what the Omniscience Index is for:
https://artificialanalysis.ai/evaluations/omniscience#aa-omn...
It rewards correct answers and penalizes hallucinations, and finally no reward for refusing to answer.
It's interesting just how poorly some popular Chinese models fare in this regard, like GLM 5.1 or DeepSeek 4 Pro.
Gemini 3.x has truly remarkable knowledge given how it leads in this benchmark despite being (quite a bit) more prone to hallucinate than Claude Opus.
> by refusing to answer all questions.
Cool, precisely the thing other AI is too stupid to do when they don't have the necessary knowledge.
Yes. A model that can answer "I don't know" would be much more trustable than the current used car salesman we have now.
Isn't that precisely the reason why we introduced the term hallucination? Because llms have historically always made up bullshit of they cannot answer directly... If they now nailed this to maybe the model not respond instead of responding incorrectly, then a lot of previously unusable usecases would become feasible.
So I feel like that's exactly the right metric and the way to track it wrt hallucinations.