Arguably I would think that the last year was mainly inner harness improvement instead model improve...

jwpapi • yesterday at 3:27 PM • 1 reply • view on HN

Arguably I would think that the last year was mainly inner harness improvement instead model improvement but I could be wrong, just feels like that to me

Replies

SatvikBeri • yesterday at 5:32 PM

We can measure this by looking at the same harness applied to different models, e.g. the very plain Terminus: https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...

Models have improved dramatically even with the same harness

alt Hacker News

Replies