logoalt Hacker News

jkelleyrtpyesterday at 6:27 PM2 repliesview on HN

claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?


Replies

gizmodo59yesterday at 6:28 PM

Its SWE bench pro not swe bench verified. The verified benchmark has stagnated

show 1 reply
Rudybegayesterday at 9:59 PM

You're comparing two different benchmarks. Pro vs Verified.