Why do none of the benchmarks test for hallucinations?

dom96 • yesterday at 7:08 PM • 2 replies • view on HN

Replies

In the text, we shared a hallucination benchmark. Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts). Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down. I wasn’t sure how to best plot this stat, so we kept it as text only, which kind of buries it, I admit.

(I work at OpenAI.)

netule • yesterday at 7:51 PM

Optics. It would be inconvenient for marketing, so they leave those stats to third parties to figure out.

alt Hacker News

Replies