logoalt Hacker News

throw310822today at 12:58 PM1 replyview on HN

Not sure I'm understanding this. The models are asked to evaluate the truth of random claims out of their own head (except for Gemini with search grounding)? Isn't it exactly the same as asking people to play any quiz game and then rating them as "they disagree n% of the time"?

The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?


Replies

kostajtoday at 2:57 PM

Agree that True and Mostly True might be very close and could be a calibration difference. Misleading and False, as well. A better headline number might be the 34% claims with substantial or polar-opposite verdicts.