logoalt Hacker News

simonwtoday at 2:15 PM2 repliesview on HN

This study treats models disagreeing - returning both true and mostly true - as a failure.


Replies

pjdesnotoday at 2:47 PM

They overstate their results in the headline.

In section 2, 34% of cases are found to have "substantive" disagreements differing by 2 or more buckets - True + Misleading, Mostly True + False, or True + False.

This is probably a better measure than the headline one. It's still a concerning fraction, although some fraction is no doubt due to forcing "I don't know" cases to return an answer anyway.

kostajtoday at 2:50 PM

Agree with @pjdesno, that the 34% substantive or polar disagreement might be a better headline number. Or even the 21% polar disagreement (at least one model True, and at least one model False), which is still high for many real-world applications.