This is a terrible benchmark. It literally tests the models on their ability to track shifting line ...

mordae • yesterday at 11:07 PM • 1 reply • view on HN

This is a terrible benchmark. It literally tests the models on their ability to track shifting line numbers. If they cannot keep up, no amount of abstract reasoning can redeem them.

Replies

lordmauve • today at 6:54 AM

Where did you get that idea? It uses mini-swe-agent, same as SWE-Bench.

https://github.com/datacurve-ai/deep-swe

➕ show 1 reply

alt Hacker News

Replies