logoalt Hacker News

modelessyesterday at 6:34 PM3 repliesview on HN

It's so difficult to compare these models because they're not running the same set of evals. I think literally the only eval variant that was reported for both Opus 4.6 and GPT-5.3-Codex is Terminal-Bench 2.0, with Opus 4.6 at 65.4% and GPT-5.3-Codex at 77.3%. None of the other evals were identical, so the numbers for them are not comparable.


Replies

alexhansyesterday at 6:56 PM

Isn't the best eval the one you build yourself, for your own use cases and value production?

I encourage people to try. You can even timebox it and come up with some simple things that might look initially insufficient but that discomfort is actually a sign that there's something there. Very similar to moving from not having unit/integration tests for design or regression and starting to have them.

rsanekyesterday at 6:51 PM

I usually wait to see what ArtificialAnalysis says for a direct comparison.

input_shyesterday at 6:39 PM

It's better on a benchmark I've never heard of!? That is groundbreaking, I'm switching immediately!

show 1 reply