I do not trust the AI benchmarks much, they often do not line up with my experience.
That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.
So very much looking forward to trying out 5.3.
5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.
I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.
Looking forward to trying 5.3.
Yeah, these benchmarks are bogus.
Every new model overfits to the latest overhyped benchmark.
Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.
ARG-AGI-2 leaderboard has a strong correlation with my Rust/CUDA coding experience with the models.
Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3
Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.
Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.