logoalt Hacker News

stingraycharlestoday at 11:03 AM0 repliesview on HN

Problem with these types of benchmarks is that it’s 100% certain the LLM has been trained on all that code already, so they’re all tainted since you don’t know whether it’s just benchmarking recall vs actual reasoning.

Same with SWE-bench and others.