logoalt Hacker News

wongarsutoday at 11:46 AM3 repliesview on HN

It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark


Replies

corlinptoday at 5:09 PM

That one is a bit sus to me, because the models that do the worst on Omniscience Accuracy do the best on non-hallucination. The top model for this benchmark is "MiniCPM5-1B (Non-reasoning)" which gets a whopping 99% vs 45% for Fable 5.

I'd love to see a good hallucination benchmark, but this isn't one. There's no possibility that a 1B model hallucinates less than Fable 5.

show 1 reply
SilverServertoday at 2:00 PM

It took me a while to figure out how to interpret the benchmark correctly, because on the overview page it says "AA-Omniscience Non-Hallucination Rate," but on the benchmark page https://artificialanalysis.ai/evaluations/omniscience#aa-omn...

it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.

andaitoday at 1:35 PM

This implies that other benchmarks (for which every AI provider is optimizing?) are actively encouraging bullshitting?

show 6 replies