claude swe-bench is 80.8 and codex is 56.8
Seems like 4.6 is still all-around better?
Its SWE bench pro not swe bench verified. The verified benchmark has stagnated
You're comparing two different benchmarks. Pro vs Verified.
Its SWE bench pro not swe bench verified. The verified benchmark has stagnated