logoalt Hacker News

jfimtoday at 5:53 PM0 repliesview on HN

I wonder how they're planning for the benchmark to stay relevant over time.

If the benchmark is to implement features that are part of an open source project, and LLMs have those changes as part of their training dataset, it seems that they could just give a verbatim or slightly modified version of the change in their training data.

And if one updates the benchmark to only incorporate code changes that are past the models knowledge cutoff, then the benchmark is less comparable over time, since the changes in the benchmark at time T and T+1 aren't the same.