Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on...

wasmainiac • yesterday at 7:10 PM • 7 replies • view on HN

Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?

Replies

tedsanders • yesterday at 8:09 PM

We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.

(I'm from OpenAI.)

➕ show 7 replies

Corence • yesterday at 7:31 PM

It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.

However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.

➕ show 1 reply

ifwinterco • yesterday at 7:54 PM

On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better

➕ show 4 replies

smcleod • yesterday at 9:37 PM

I don't think much from OpenAI can be trusted tbh.

aaaalone • yesterday at 7:19 PM

At the end of the day you test it for your use cases anyway but it makes it a great initial hint if it's worth it to test out.

cyanydeez • yesterday at 7:36 PM

When do you think we should run this benchmark? Friday, 1pm? Monday 8AM? Wednesday 11AM?

I definitely suspect all these models are being degraded during heavy loads.

➕ show 1 reply

thinkingtoilet • yesterday at 9:22 PM

We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.

➕ show 2 replies

alt Hacker News

Replies