I do not trust the AI benchmarks much, they often do not line up with my experience. That said ......

the_duke • yesterday at 6:22 PM • 6 replies • view on HN

I do not trust the AI benchmarks much, they often do not line up with my experience.

That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.

So very much looking forward to trying out 5.3.

Replies

NitpickLawyer • yesterday at 6:30 PM

Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.

➕ show 2 replies

aurareturn • yesterday at 6:42 PM

5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.

I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.

Looking forward to trying 5.3.

➕ show 1 reply

fooker • yesterday at 6:39 PM

Yeah, these benchmarks are bogus.

Every new model overfits to the latest overhyped benchmark.

Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.

➕ show 3 replies

mmaunder • yesterday at 11:46 PM

ARG-AGI-2 leaderboard has a strong correlation with my Rust/CUDA coding experience with the models.

nerdsniper • yesterday at 7:40 PM

Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3

➕ show 1 reply

jahsome • yesterday at 6:30 PM

Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.

➕ show 6 replies

alt Hacker News

Replies