logoalt Hacker News

corlinptoday at 5:09 PM1 replyview on HN

That one is a bit sus to me, because the models that do the worst on Omniscience Accuracy do the best on non-hallucination. The top model for this benchmark is "MiniCPM5-1B (Non-reasoning)" which gets a whopping 99% vs 45% for Fable 5.

I'd love to see a good hallucination benchmark, but this isn't one. There's no possibility that a 1B model hallucinates less than Fable 5.


Replies

nvme0n1p1today at 8:15 PM

> There's no possibility that a 1B model hallucinates less than Fable 5.

Sure there is. The simplest possible model that says "I don't know" 100% of the time would hallucinate less than Fable 5. Scaling up to more useful models, it's just a matter of tuning false negatives vs false positives.