This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they c...

enlyth • last Thursday at 7:22 PM • 2 replies • view on HN

This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI

Replies

tobias2014 • last Friday at 1:41 AM

And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.

minadotcom • last Thursday at 9:24 PM

agreed.

alt Hacker News

Replies