Bit concerning that we see in some cases significantly worse results when enabling thinking. Especia...

ZeroCool2u • yesterday at 6:27 PM • 4 replies • view on HN

Bit concerning that we see in some cases significantly worse results when enabling thinking. Especially for Math, but also in the browser agent benchmark.

Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.

Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.

Replies

oersted • yesterday at 6:43 PM

I believe you are looking at GPT 5.4 Pro. It's confusing in the context of subscription plan names, Gemini naming and such. But they've had the Pro version of the GPT 5 models (and I believe o3 and o1 too) for a while.

It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.

Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.

➕ show 3 replies

highfrequency • yesterday at 6:36 PM

Can you be more specific about which math results you are talking about? Looks like significant improvement on FrontierMath esp for the Pro model (most inference time compute).

➕ show 1 reply

andoando • yesterday at 6:52 PM

The thinking models are additionally trained with reinforcement learning to produce chain of thought reasoning

aplomb1026 • yesterday at 6:32 PM

[dead]

alt Hacker News

Replies