logoalt Hacker News

anentropictoday at 10:11 AM3 repliesview on HN

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks


Replies

networkedtoday at 12:38 PM

51% doesn't tell you much by itself. Benchmarks like this are usually not graded on a curve and aren't calibrated so that 100% is the performance level of a qualified human. You could design a superhuman benchmark where 10% was the human level of performance.

Looking at https://www.tbench.ai/leaderboard/terminal-bench/2.0, I see that the current best score is 75%, meaning 51% is ⅔ SOTA.

show 1 reply
pitchedtoday at 11:59 AM

That score is on par with Gemini 3 Flash but these scores look much more affected by the agent used than the model, from scrolling through the results.

show 1 reply
YetAnotherNicktoday at 12:51 PM

TerminalBench is like the worst named benchmark. It has almost nothing to do with terminal, but random tools syntax. Also it's not agentic for most tasks if the model memorized some random tool command line flags.

show 1 reply