That's just not accurate. I haven't studied SWE Bench Pro in detail, so I can't tell ...

SpicyLemonZest • yesterday at 2:38 PM • 1 reply • view on HN

That's just not accurate. I haven't studied SWE Bench Pro in detail, so I can't tell you exactly what the flaw is, but SOTA models routinely make bad architectural choices I have to intervene to fix.

Replies

threepts • yesterday at 3:31 PM

You can read the paper here: https://labs.scale.com/papers/swe_bench_pro

TL;DR its very effective as it directly tests model on REAL codebases: "The benchmark is constructed from GPL-style copyleft repositories and private proprietary codebases". The use case is very real.

➕ show 1 reply

alt Hacker News

Replies