Is claude code the best coding harness? Anyone running evals on that?
Terminal Bench is testing agent harness.
The best two are Codex and Forge Code.
However I am using plugins and skills that are only compatible with Claude Code or work best with Claude Code.
So, for me, Claude Code with plugins like claude-meme, Context Mode, Superpowers and Get Shit Done is better than other tools.
I think everyone should test multiple models and multiple agent harness for his specific needs, codebase and way of working.
In my anecdotal experience, it is not. Same model, opus, works better in 3P harnesses such as Factory Droid or Amp.
Claude code, on the other hand, is the most subsidized one, both for consumers (through max subscription) and for enterprises (token discounts). It is also heavily optimized for cost, specially token caching and reduced thinking, at the expense of quality.