Problem with these types of benchmarks is that it’s 100% certain the LLM has been trained on all tha...

stingraycharles • today at 11:03 AM • 0 replies • view on HN

Problem with these types of benchmarks is that it’s 100% certain the LLM has been trained on all that code already, so they’re all tainted since you don’t know whether it’s just benchmarking recall vs actual reasoning.

Same with SWE-bench and others.

alt Hacker News