Terminal Bench 2.0
| Name | Score |
|---------------------|-------|
| OpenAI Codex 5.3 | 77.3 |
| Anthropic Opus 4.6 | 65.4 |Benchmarks are useless compared to real world performance.
Real world performance for these models is a disappoint.
yea but i feel like we are over the hill on benchmaxxing, many times a model has beaten anthropic on a specific bench, but the 'feel' is that it is still not as good at coding