logoalt Hacker News

mdasenyesterday at 2:31 PM6 repliesview on HN

It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.

Is there a leaderboard out there comparing harness results using the same models?


Replies

manxyesterday at 3:42 PM

We probably want to compare the cartesian product of model+harness.

nikcubyesterday at 10:01 PM

the most cited is terminal bench 2.0, but its also plagued by cheating accusations and benchmaxxing.

somewhat remarkably, claude code ranks last for Opus 4.6 - which may say something about cc, or say something about the benchmark

[0] https://www.tbench.ai/leaderboard/terminal-bench/2.0

culiyesterday at 5:37 PM

Maybe the future isn't a human-like centralized intelligence but an octopus-like decentralized intelligence where more focus is placed on making the harness itself "smart"

show 1 reply
isegeyesterday at 8:44 PM

Isn't that what terminal-bench does?

GodelNumberingyesterday at 3:13 PM

I really wish there was! I thought of even creating one but it would be conflict of interest

alfiedotwtftoday at 6:52 AM

For my local tests the past few months on the same local model, I’ve found Claude Code to be way better than OpenCode, and OpenCode to be better than Codex.