It's so difficult to compare these models because they're not running the same set of eval...

modeless • yesterday at 6:34 PM • 3 replies • view on HN

It's so difficult to compare these models because they're not running the same set of evals. I think literally the only eval variant that was reported for both Opus 4.6 and GPT-5.3-Codex is Terminal-Bench 2.0, with Opus 4.6 at 65.4% and GPT-5.3-Codex at 77.3%. None of the other evals were identical, so the numbers for them are not comparable.

Replies

alexhans • yesterday at 6:56 PM

Isn't the best eval the one you build yourself, for your own use cases and value production?

I encourage people to try. You can even timebox it and come up with some simple things that might look initially insufficient but that discomfort is actually a sign that there's something there. Very similar to moving from not having unit/integration tests for design or regression and starting to have them.

rsanek • yesterday at 6:51 PM

I usually wait to see what ArtificialAnalysis says for a direct comparison.

input_sh • yesterday at 6:39 PM

It's better on a benchmark I've never heard of!? That is groundbreaking, I'm switching immediately!

➕ show 1 reply

alt Hacker News

Replies