Is there a good benchmark tracking hallucinations? The models are all incredibly good now, even the ...

aliljet • yesterday at 6:20 PM • 8 replies • view on HN

Is there a good benchmark tracking hallucinations? The models are all incredibly good now, even the open ones, and my hope is that the rate of hallucinations is something that's falling off in concert with larger and larger context lengths.

Replies

vlmutolo • today at 3:17 AM

> While OpenAI originally pioneered Codex (which went on to power GitHub Copilot), Google’s direct answer for dedicated, native code completion and natural-language-to-code generation is CodeGemma.

https://g.co/gemini/share/33e7a589a161

WarmWash • yesterday at 6:38 PM

People complain about them incessantly, but I can almost never get people to actually post receipts. Every provider allows sharing chats, and anyone can share a prompt that reliably produces hallucinations.

More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.

Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.

➕ show 9 replies

throawayonthe • yesterday at 6:32 PM

well there is https://artificialanalysis.ai/evaluations/omniscience

➕ show 1 reply

Sevii • yesterday at 6:25 PM

I haven't been bothered by hallucinations in premier models since early last year. Still see it in smaller local models though.

➕ show 1 reply

krupan • yesterday at 8:22 PM

It really depends what you are asking it. If the answer is in the training data, then the odds of it lying to you are much lower than if you are asking it for something it has never seen before.

majso • yesterday at 6:36 PM

maybe something like this? https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

FergusArgyll • yesterday at 6:51 PM

As long as the model uses web search, they almost never hallucinate anymore. The fast models (haiku, gpt-instant, flash) still sometimes have the problem where they don't search before answering so they can hallucinate

➕ show 1 reply

yieldcrv • yesterday at 6:33 PM

if last year's models were the ones people got familiar with in late 2022, hallucinations would be an underrepresented rumor, there would be no articles about it because its so rare. overconfident lawyers wouldn't have messed up dockets in court with fake case law, in other domains that move faster, sources would be only partially outdated with agentic search and mcp servers filling in the gaps

AI psychosis would be the problem people talk about more, not just outright agreement but subtle ways of making you feel confident in your ideas. "yes, buy that domain name buy these other ones for defensibility"

(the domain name is dumb and completely unmarketable)

➕ show 1 reply

alt Hacker News

Replies