You can read the paper here:

threepts • yesterday at 3:31 PM • 1 reply • view on HN

You can read the paper here: https://labs.scale.com/papers/swe_bench_pro

TL;DR its very effective as it directly tests model on REAL codebases: "The benchmark is constructed from GPL-style copyleft repositories and private proprietary codebases". The use case is very real.

Replies

SpicyLemonZest • yesterday at 3:52 PM

It doesn't sound to me like this benchmark is attempting to measure architecture design. As far as I see in the paper, they do not evaluate the architectural quality of a task completion, only whether the model is capable of completing it at all.

alt Hacker News

Replies