Terminal Bench is testing agent harness.
The best two are Codex and Forge Code.
However I am using plugins and skills that are only compatible with Claude Code or work best with Claude Code.
So, for me, Claude Code with plugins like claude-meme, Context Mode, Superpowers and Get Shit Done is better than other tools.
I think everyone should test multiple models and multiple agent harness for his specific needs, codebase and way of working.