logoalt Hacker News

Cynddltoday at 7:13 PM1 replyview on HN

Once again an evaluation missing confidence intervals. “continued improvement” and “significant improvement” but without any significance testing is moot.

With many colleagues (including from AISI themselves!), we recently reviewed 445 the AI benchmarks & evaluations from the past few years. Our work was published at NeurIPS (https://openreview.net/pdf?id=mdA5lVvNcU) and we made eight recommendations for better evaluations. One is “use statistical methods to compare models”:

□ Report the benchmark’s sample size and justify its statistical power

□ Report uncertainty estimates for all primary scores to enable robust model comparisons

□ If using human raters, describe their demographics and mitigate potential demographic biases in rater recruitment and instructions

□ Use metrics that capture the inherent variability of any subjective labels, without relying on single-point aggregation or exact matching.

I would strongly recommend taking these blog posts with a grain of salt, as there is very little that can be learned without proper evaluations.


Replies

ooloncoloophidtoday at 7:33 PM

The point about confidence intervals is a good one and I'd like to see it more often. My neighbour Alan is a good farmer, but I am not.