Hallucination rate scores are a little tricky to interpret because they're conditional on the m...

aesthesia • yesterday at 12:34 AM • 7 replies • view on HN

Hallucination rate scores are a little tricky to interpret because they're conditional on the model not knowing the answer. That means they don't measure the probability of your encountering a hallucination in everyday use, since that also depends on the probability of the model not knowing the answer, as well as how well your distribution of tasks aligns with the distribution tested in the eval.

I'd also hesitate to attribute this difference in hallucination rates purely to model size. Yes, GLM-5.2 hallucinates much less frequently than DeepSeek-V4 Pro with twice as many parameters, but DeepSeek-V4 Flash is less than half the size of GLM-5.2 and tops the AA-Omniscience hallucination index. Opus 4.8, which is likely larger than DeepSeek-V4 Pro, has a 36% hallucination rate on the index, above GLM-5.2's 28%, but way below the DeepSeek numbers. Opus also has a 47% accuracy rate vs GLM-5.2's 25%. If you use these numbers to calculate the absolute hallucination rate (i.e., the number of hallucinated responses divided by the total number of responses), you get 19% for Opus and 21% for GLM-5.2.

So yes, all else equal larger models may be more prone to hallucination in scenarios where they don't know the answer, but there are a lot of other factors that affect hallucination rates, and it's not totally clear that this is the main metric that's worth tracking.

Replies

ComputerGuru • yesterday at 3:21 PM

I’m not disagreeing with you but at the same time, models don’t “know” anything in that binary sense. I’m not trying to get in the woods here, I genuinely mean that what you pass off as a simple explanation is actually incredibly nuanced. A fact appeared once in training data , a fact never appeared in the training data, a fact appeared ten times, a fact appeared a thousand times. Which does the model know? Facts aren’t stored as-is, they’re all broken down into their components and compressed in the weights. “Similar” facts that didn’t appear an overwhelming number of times get bundled together and eventually conflated. But then what is a similar fact? Which facts were entirely ablated vs which were bundled together with others effectively poisoning the pool but also giving it inference strength? The model doesn’t know anything and can never know what it knows or doesn’t know.

➕ show 1 reply

in-silico • yesterday at 2:21 AM

Additionally, maybe it's easier for a model to realize that it doesn't know the answer when the question is easier.

If Opus gets all but the hardest questions right, it might have a higher hallucination rate because the questions it gets wrong are the questions where verification or hallucination detection are the most difficult

andix • yesterday at 1:48 PM

I guess you can test that on hypotheticals. Ask about things after the knowledge cut off that never happened. Or ask things that are genuinely unsolvable.

sudosysgen • yesterday at 5:31 AM

This is missing a common failure mode, which is information past the knowledge cutoff. If you need info past that time they'll fail no matter how big or small the model is, so the hallucination rate can matter independently of the knowledge base. If all use-cases had a uniform risk of falling out of support, this would be a valid argument, but since it's often the case that a datapoint is guaranteed to fall out of support, the absolute ability to recognize that is crucial.

reinitctxoffset • yesterday at 5:15 AM

Hallucination should be called "failure to ground".

Something about the cost model of US near frontier has the cattle prod out whenever a model is uncertain but thrashes on whether to search. Search flinch is roughly all hallucination.

I don't even wait for the model's turn, if there's a man page or Hoogle hit, stuff the last prefix cache cut point. You come out ahead.

gymbeaux • yesterday at 5:49 AM

Those numbers are abysmal. Should we really be using LLMs to write our code? I have a theory- LLMs can spit out code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time. An enterprise app developed entirely with LLM-happy devs might end up virtually unmaintainable.

I’m not sure how to explain it, but the more I see LLM-written code the more I feel it’s bad code doing a good job of masquerading as good code. I think this take will become less-hot in the next year or two when we see enterprise greenfield projects that were created entirely with LLM “assistance” go to prod. I think we’ll find that the code is difficult for humans to read, understand, debug, and extend- and I think the larger the codebase the harder it will be for LLMs to maintain. More opportunity for hallucination, larger context windows needed, more tokens bought and spent for smaller and smaller code changes. I think the more code an LLM writes for an app, the worse that codebase becomes.

➕ show 9 replies

grayhatter • yesterday at 2:56 AM

> Hallucination rate scores are a little tricky to interpret because they're conditional on the model not knowing the answer. That means they don't measure the probability of your encountering a hallucination in everyday use, since that also depends on the probability of the model not knowing the answer, as well as how well your distribution of tasks aligns with the distribution tested in the eval.

Do you have a cite for this?

If a human makes up some bullshit lie, I wouldn't accuse them of making it up only if they actually knew the correct answer. If you don't know, the only correct answer is I don't know. Any other answer is made up bullshit. Why is it only a hallucination if and only if the LLM contains the answer? If you make something up it's still wrong. It shouldn't matter if you could give the correct answer. You didn't, and instead invented some bullshit instead?

Follow up question, how can I apply this rule set to the next test I have to take? I'd love to be able to use "I didn't know" as the excuse for why I made something up.

edit:

> and it's not totally clear that this is the main metric that's worth tracking.

I don't know, the rate at which some model is willing to make up something feels useful. If the argument I see repeated on HN so much is that it's impossible to completely get rid of hallucinations; being able to choose a model that's less likely to invent some lie seems like a positive trait, no?

Either way, I'm happy to agree that a restrictive definition, where a lie doesn't count as a hallucination iff the model doesn't know the answer feels strictly, infinitely less useful than an exact error rate. What percentage of emitted tokens are misleading would be useful for me. Anyone know any group that's attempted to quantify the global error rate?

➕ show 4 replies

alt Hacker News

Replies