Real world usage is unlikely to give you the large sample sizes needed to reliably detect the differences between models. Standard error scales as the inverse square root of sample size, so even a difference as large as 10 percentage points would require hundreds of samples.
https://marginlab.ai/trackers/claude-code/ tries to track Claude Opus performance on SWE-Bench-Pro, but since they only sample 50 tasks per day, the confidence intervals are very wide. (This was submitted 2 months ago https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)
It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark