logoalt Hacker News

dom96yesterday at 7:08 PM2 repliesview on HN

Why do none of the benchmarks test for hallucinations?


Replies

tedsandersyesterday at 8:34 PM

In the text, we shared a hallucination benchmark. Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts). Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down. I wasn’t sure how to best plot this stat, so we kept it as text only, which kind of buries it, I admit.

(I work at OpenAI.)

netuleyesterday at 7:51 PM

Optics. It would be inconvenient for marketing, so they leave those stats to third parties to figure out.