the most cited is terminal bench 2.0, but its also plagued by cheating accusations and benchmaxxing.
somewhat remarkably, claude code ranks last for Opus 4.6 - which may say something about cc, or say something about the benchmark
[0] https://www.tbench.ai/leaderboard/terminal-bench/2.0