This might be inherent to how the models are benchmarked. Aren’t some benchmarks giving the model ...

jiehong • today at 6:27 AM • 1 reply • view on HN

This might be inherent to how the models are benchmarked.

Aren’t some benchmarks giving the model multiple shots at a problem and only keep the successful result if it appeared, ignoring the failure rate?

andyferris • today at 6:34 AM

Good point. We need the mean, “any 1 of 10” and the “all 10 of 10” success rates in the metrics, so we can estimate reliability (the last one).

alt Hacker News