logoalt Hacker News

gizmodo59yesterday at 6:14 PM4 repliesview on HN

5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!


Replies

wasmainiacyesterday at 7:10 PM

Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?

show 7 replies
purplerabbityesterday at 6:29 PM

The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out

show 1 reply
nharadayesterday at 6:23 PM

That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...

jkelleyrtpyesterday at 6:27 PM

claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?

show 2 replies