This study treats models disagreeing - returning both true and mostly true - as a failure.

simonw • today at 2:15 PM • 2 replies • view on HN

Replies

They overstate their results in the headline.

In section 2, 34% of cases are found to have "substantive" disagreements differing by 2 or more buckets - True + Misleading, Mostly True + False, or True + False.

This is probably a better measure than the headline one. It's still a concerning fraction, although some fraction is no doubt due to forcing "I don't know" cases to return an answer anyway.

kostaj • today at 2:50 PM

Agree with @pjdesno, that the 34% substantive or polar disagreement might be a better headline number. Or even the 21% polar disagreement (at least one model True, and at least one model False), which is still high for many real-world applications.

alt Hacker News

Replies