logoalt Hacker News

easygenestoday at 7:53 AM0 repliesview on HN

Have you run it through DeepSWE? I understand that's probably a high ask for this class of model, but would be interesting to see regardless.

Even if it can't fully pass much, there are so many tests against most of the scenarios that you can get a fairly rich report beyond the pass@1 stat. See e.g. this DeepSWE report against the Minimax M3 model: https://entrpi.github.io/misc/deep-swe-minimax-m3/