logoalt Hacker News

staredtoday at 7:17 AM0 repliesview on HN

SWE-bench Verified is, at this point, contaminated https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

So it os hard to tell how much of a model gain is due to skill, and how much - overfitting.