It's really interesting how much the AI harness seems to matter. Going from 48% via Google'...

mdasen • yesterday at 2:31 PM • 6 replies • view on HN

It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.

Is there a leaderboard out there comparing harness results using the same models?

Replies

manx • yesterday at 3:42 PM

We probably want to compare the cartesian product of model+harness.

nikcub • yesterday at 10:01 PM

the most cited is terminal bench 2.0, but its also plagued by cheating accusations and benchmaxxing.

somewhat remarkably, claude code ranks last for Opus 4.6 - which may say something about cc, or say something about the benchmark

[0] https://www.tbench.ai/leaderboard/terminal-bench/2.0

culi • yesterday at 5:37 PM

Maybe the future isn't a human-like centralized intelligence but an octopus-like decentralized intelligence where more focus is placed on making the harness itself "smart"

➕ show 1 reply

isege • yesterday at 8:44 PM

Isn't that what terminal-bench does?

GodelNumbering • yesterday at 3:13 PM

I really wish there was! I thought of even creating one but it would be conflict of interest

alfiedotwtf • today at 6:52 AM

For my local tests the past few months on the same local model, I’ve found Claude Code to be way better than OpenCode, and OpenCode to be better than Codex.

alt Hacker News

Replies