TerminalBench is like the worst named benchmark. It has almost nothing to do with terminal, but rand...

YetAnotherNick • yesterday at 12:51 PM • 1 reply • view on HN

TerminalBench is like the worst named benchmark. It has almost nothing to do with terminal, but random tools syntax. Also it's not agentic for most tasks if the model memorized some random tool command line flags.

Replies

esafak • yesterday at 2:58 PM

What do you mean? It tests whether the model knows the tools and uses them.

➕ show 1 reply

alt Hacker News

Replies