It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, ...

wongarsu • today at 11:46 AM • 3 replies • view on HN

It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark

Replies

corlinp • today at 5:09 PM

That one is a bit sus to me, because the models that do the worst on Omniscience Accuracy do the best on non-hallucination. The top model for this benchmark is "MiniCPM5-1B (Non-reasoning)" which gets a whopping 99% vs 45% for Fable 5.

I'd love to see a good hallucination benchmark, but this isn't one. There's no possibility that a 1B model hallucinates less than Fable 5.

➕ show 1 reply

SilverServer • today at 2:00 PM

It took me a while to figure out how to interpret the benchmark correctly, because on the overview page it says "AA-Omniscience Non-Hallucination Rate," but on the benchmark page https://artificialanalysis.ai/evaluations/omniscience#aa-omn...

it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.

andai • today at 1:35 PM

This implies that other benchmarks (for which every AI provider is optimizing?) are actively encouraging bullshitting?

➕ show 6 replies

alt Hacker News

Replies