For those who cared: GPT-5.3-Codex dominates terminal coding with a roughly 12% lead (Terminal-Ben...

karmasimida • yesterday at 7:45 PM • 1 reply • view on HN

For those who cared:

GPT-5.3-Codex dominates terminal coding with a roughly 12% lead (Terminal-Bench 2.0), while Opus 4.6 retains the edge in general computer use by 8% (OSWorld).

Anyone knows the difference between OSWorld vs OSWorld Verified?

Replies

nopinsight • yesterday at 8:43 PM

From Claude 4.6 Thinking:

OSWorld is the full 369-task benchmark. OSWorld Verified is a ~200-task subset where humans have confirmed the eval scripts reliably score success/failure — the full set has some noisy grading where correct actions can still get marked wrong.

Scores on Verified tend to run higher, so they're not directly comparable.

alt Hacker News

Replies