logoalt Hacker News

toshyesterday at 6:46 PM2 repliesview on HN

Terminal Bench 2.0

  | Name                | Score |
  |---------------------|-------|
  | OpenAI Codex 5.3    | 77.3  |
  | Anthropic Opus 4.6  | 65.4  |

Replies

greenfish6yesterday at 6:47 PM

yea but i feel like we are over the hill on benchmaxxing, many times a model has beaten anthropic on a specific bench, but the 'feel' is that it is still not as good at coding

show 3 replies
xystyesterday at 9:03 PM

Benchmarks are useless compared to real world performance.

Real world performance for these models is a disappoint.