>hopefully changes the way benchmarking is done.
Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.
Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?
Also, fuzz your benchmarks
But that requires me to do things :(
[dead]
solution is simple:
if bug { dont }
/s
In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.