logoalt Hacker News

granzymesyesterday at 6:12 PM6 repliesview on HN

I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!

The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.

GPT-5.3-codex scores 77.3.


Replies

the_dukeyesterday at 6:22 PM

I do not trust the AI benchmarks much, they often do not line up with my experience.

That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.

So very much looking forward to trying out 5.3.

show 6 replies
leumonyesterday at 7:19 PM

they tested it at xhigh reasoning though, which is probably double the cost of Anthropic's model.

Cost to Run Artificial Analysis Intelligence Index:

GPT-5.2 Codex (xhigh): $3244

Claude Opus 4.5-reasoning: $1485

(and probably similar values for the newer models?)

show 2 replies
__jl__yesterday at 6:16 PM

Impressive jump for GPT-5.3-codex and crazy to see two top coding models come out on the same day...

show 1 reply
wilgyesterday at 7:31 PM

In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.

show 3 replies
jronakyesterday at 7:34 PM

Did you look at the ARC AGI 2? Codex might be overfit for terminal bench

show 1 reply
nurettinyesterday at 6:38 PM

Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.

Hopefully performance will pick up after the rollout.

show 1 reply