People complain about a lot of things. Claude has been fine:

solenoid0937 • yesterday at 4:06 PM • 5 replies • view on HN

https://marginlab.ai/trackers/claude-code-historical-perform...

Replies

I will be the first to acknowledge that humans are a bad judge of performance and that some of the allegations are likely just hallucinations...

But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?

My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.

Majromax • yesterday at 4:20 PM

While that's a nice effort, the inter-run variability is too high to diagnose anything short of catastrophic model degradation. The typical 95% confidence interval runs from 35% to 65% pass rates, a full factor of two performance difference.

Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.

➕ show 1 reply

jofzar • today at 12:50 AM

Matrix also found that Claude was AB testing 4.6 vs 4.7 in production for the last 12 days.

https://matrix.dev/blog-2026-04-16

cbg0 • yesterday at 4:20 PM

That performance monitor is super easy to game if you cache responses to all the SWE bench questions.

➕ show 1 reply

sumedh • yesterday at 10:14 PM

Your link shows there have been huge drops.

How is it fine?

alt Hacker News

Replies