logoalt Hacker News

Oxlamarrtoday at 11:00 AM0 repliesview on HN

The harness point is probably the most important part here. With terminal agents, it often feels like the benchmark is measuring the whole loop: model, prompt, tool interface, retry policy, timeout handling, and recovery from bad shell commands.

Do you have a sense of which part contributed most to the jump?