Is there a good benchmark tracking hallucinations? The models are all incredibly good now, even the open ones, and my hope is that the rate of hallucinations is something that's falling off in concert with larger and larger context lengths.
People complain about them incessantly, but I can almost never get people to actually post receipts. Every provider allows sharing chats, and anyone can share a prompt that reliably produces hallucinations.
More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.
Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.
well there is https://artificialanalysis.ai/evaluations/omniscience
I haven't been bothered by hallucinations in premier models since early last year. Still see it in smaller local models though.
It really depends what you are asking it. If the answer is in the training data, then the odds of it lying to you are much lower than if you are asking it for something it has never seen before.
maybe something like this? https://petergpt.github.io/bullshit-benchmark/viewer/index.v...
As long as the model uses web search, they almost never hallucinate anymore. The fast models (haiku, gpt-instant, flash) still sometimes have the problem where they don't search before answering so they can hallucinate
if last year's models were the ones people got familiar with in late 2022, hallucinations would be an underrepresented rumor, there would be no articles about it because its so rare. overconfident lawyers wouldn't have messed up dockets in court with fake case law, in other domains that move faster, sources would be only partially outdated with agentic search and mcp servers filling in the gaps
AI psychosis would be the problem people talk about more, not just outright agreement but subtle ways of making you feel confident in your ideas. "yes, buy that domain name buy these other ones for defensibility"
(the domain name is dumb and completely unmarketable)
> While OpenAI originally pioneered Codex (which went on to power GitHub Copilot), Google’s direct answer for dedicated, native code completion and natural-language-to-code generation is CodeGemma.
https://g.co/gemini/share/33e7a589a161