logoalt Hacker News

wongarsulast Sunday at 12:27 PM0 repliesview on HN

> Designing benchmarks resistant to adversarial attempts to exploit the benchmark software is just something no one was thinking about when they created SWE-bench

That seems like a major oversight. "AI does whatever maximizes reward/minimizes loss, not what you actually want" is one of the biggest challenges in ML in the last two decades (relevant here because researchers selecting architectures and training regimens that maximize public benchmarks are just a bigger training loop with those benchmarks as reward function). And the analogous issue post-training in AGI-like systems is well studied as the alignment problem, the core issue of classical AI safety

If cheating the benchmark is easier than passing it, you expect the cheating strategy to emerge and win. (Just like you would with humans btw)