logoalt Hacker News

stanfordkidtoday at 1:54 AM0 repliesview on HN

I don't find this paper very compelling. Obviously it would be fraud if the code generated simply escaped the harness vs solving the actual problem. I agree that theoretically models could learn to do that, and it is important to highlight, but my sense is that those entities reporting the benchmark scores would have an obligation to observe this behavior and re-consider the metrics they report. It is a bit like saying it's possible to cheat in football because the balls are deflatable. It matters, and some have done it, but it doesn't mean widespread cheating is taking place. The paper takes the tone that there is already a lot of cheating happening which I do not think is the case.