Early benchmark results on our private complex reasoning suite:

gertlabs • yesterday at 8:31 PM • 2 replies • view on HN

Early benchmark results on our private complex reasoning suite: https://gertlabs.com/?mode=agentic_coding

Opus 4.7 is more strategic, more intelligent, and has a higher intelligence floor than 4.6 or 4.5. It's roughly tied with GPT 5.4 as the frontier model for one-shot coding reasoning, and in agentic sessions with tools, it IS the best, as advertised (slightly edging out Opus 4.5, not a typo).

We're still running more evals, and it will take a few days to get enough decision making (non-coding) simulations to finalize leaderboard positions, but I don't expect much movement on the coding sections of the leaderboard at this point.

Even Anthropic's own model card shows context handling regressions -- we're still working on adding a context-specific visualization and benchmark to the suite to give you the objective numbers there.

Replies

carbocation • yesterday at 11:48 PM

Is there a page where I could read more? What's unintuitive at a glance is that Opus 4.7 has a lower success rate than Sonnet 4.6 (90% vs 100%) while having a higher Avg Percentile (87.2% vs 70.9%).

➕ show 1 reply

OsrsNeedsf2P • yesterday at 10:09 PM

Do your benchmark results indicate any level of regression on Opus 4.6 or 4.5 since their first release?

➕ show 1 reply

alt Hacker News

Replies