Reminds me of the recent Terminal Bench controversy [1][2][3] If theres a benchmark, people will c...

bitlad • today at 9:54 AM • 1 reply • view on HN

Reminds me of the recent Terminal Bench controversy [1][2][3]

If theres a benchmark, people will cheat, lie and optimize for that benchmark. Honest depends on the compliance enforced on teams. But if, compliance itself is weak, it is going to be taken advantage of. Like growing up india, you would optimize for the exam and not what you learn from it.

[1] https://news.ycombinator.com/item?id=47920787

[2] https://www.tbench.ai/news/leaderboard-integrity-update

[3] https://debugml.github.io/cheating-agents/

Replies

puzpuzpuz-hn • today at 1:31 PM

Exactly! The task gets even trickier when you're benchmarking lots of systems of different kinds: cloud databases, self-hosted ones, embedded engines, CLI tools.

alt Hacker News

Replies