logoalt Hacker News

whynotminotyesterday at 1:55 PM1 replyview on HN

Don’t get me wrong Gemini 3 is very impressive! It just seems to always need to give you an answer, even if it has to make it up.

This was also largely how ChatGPT behaved before 5, but OpenAI has gotten much much better at having the model admit it doesn’t know or tell you that the thing you’re looking for doesn’t exist instead of hallucinating something plausible sounding.

Recent example, I was trying to fetch some specific data using an API, and after reading the API docs, I couldn’t figure out how to get it. I asked Gemini 3 since my company pays for that. Gemini gave me a plausible sounding API call to make… which did not work and was completely made up.


Replies

cubefoxyesterday at 9:04 PM

Okay, I haven't really tested hallucinations like this, that may well be true. There is another weakness of GPT-5 (including 5.1 and 5.2) I discovered: I have a neat philosophical paradox about information value. This is not in the pre-training data, because I came up with the paradox myself, and I haven't posted it online. So asking a model to solve the paradox is a nice little intelligence test about informal/philosophical reasoning ability.

If I ask ChatGPT to solve it, the non-thinking GPT-5 model usually starts out confidently with a completely wrong answer and then smoothly transitions into the correct answer. Though without flagging that half the answer was wrong. Overall not too bad.

But if I choose the reasoning GPT-5 model, it thinks hardly at all (6 seconds when I just tried) and then gives a completely wrong answer, e.g. about why a premiss technically doesn't hold under contrived conditions, ignoring the fact that the paradox persists even with those circumstances excluded. Basically, it both over- and underthinks the problem. When you tell it that it can ignore those edge cases because they don't affect the paradox, it overthinks things even more and comes up with other wrong solutions that get increasingly technical and confused.

So in this case the GPT-5 reasoning model is actually worse than the version without reasoning. Which is kind of impressive. Gemini 3 Pro generally just gives the correct answer here (it always uses reasoning).

Though I admit this is just a single example and hardly significant. I guess it reveals that the reasoning training is trained hard on more verifiable things like math and coding but very brittle at philosophical thinking that isn't just repeating knowledge it gained during pre-training.

Maybe another interesting data point: If you ask either of ChatGPT/Gemini why there are so many dark mode websites (black background with white text) but basically no dark mode books, both models come up with contrived explanations involving printing costs. Which would be highly irrelevant for modern printers. There is a far better explanation than that, but both LLMs a) can't think of it (which isn't too bad, the explanation isn't trivial) and b) are unable to say "Sorry, I don't really know", which is much worse.

Basically, if you ask either LLM for an explanation for something, they seem to always try to answer (with complete confidence) with some explanation, even if it is a terrible explanation. That seems related to the hallucination you mentioned, because in both cases the model can't express its uncertainty.