Is it me or they very carefully do not report performance on GPT-5.4 Pro, only the default GPT-5.4? ...

Cynddl • yesterday at 11:11 PM • 1 reply • view on HN

Is it me or they very carefully do not report performance on GPT-5.4 Pro, only the default GPT-5.4? They also very carefully left Anthropic models out of their comparison.

I went back to the BixBench benchmark which they mentioned. I couldn't find official results for Anthropic models, but I found a project taking Opus 4.6 from 65.3% to 92.0% (which would be above GPT-Rosalind) with nearly 200 carefully crafted skills [1]. There also appears to be competitive competitor models with scores on par with this tuned GPT.

[1] https://github.com/jaechang-hits/SciAgent-Skills

Replies

jadusm • yesterday at 11:32 PM

Bix Bench seems like a really interesting/useful idea but most of the value for a layperson (like me) is comparing the results of different models on the benchmark. From what I can find there is no centralised & updated model results set. Shame.

alt Hacker News

Replies