logoalt Hacker News

operatingthetanlast Saturday at 7:55 PM6 repliesview on HN

>hopefully changes the way benchmarking is done.

Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.


Replies

ZeroGravitaslast Saturday at 8:24 PM

In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.

show 1 reply
siva7last Saturday at 8:07 PM

Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?

show 4 replies
Leynoslast Saturday at 8:02 PM

Also, fuzz your benchmarks

nananana9last Sunday at 7:57 AM

But that requires me to do things :(

claud_iayesterday at 10:02 AM

[dead]

Aperockylast Saturday at 11:49 PM

solution is simple:

if bug { dont }

/s