(I work at Cursor) We score well on Terminal-Bench and SWE-bench Multilingual. DeepSWE, not so great...

leerob • today at 5:06 PM • 0 replies • view on HN

(I work at Cursor) We score well on Terminal-Bench and SWE-bench Multilingual. DeepSWE, not so great yet, as it's more for very long-horizon tasks. We're planning to include more public benchmarks in our next model release.

alt Hacker News