logoalt Hacker News

spuztoday at 3:17 PM1 replyview on HN

As well as measuring how many questions each model was able to answer correctly, I think it's equally important to measure how many questions each model answered incorrectly. After all, if you consider using them as a tool, you will need to have confidence that any answer they give is correct.

If you look at Table 3 you can see the difference in performance between for example GPT 5.5 and Opus 4.7 for each of the 20x 100 runs:

- GPT 5.5: 1389/2000 questions answered, of which 1043 were correct (75%)

- Opus: 1306/2000 questions answered, of which 294 were correct (22%)

So while you can claim that Opus solved 40% of the problems it still had a failure rate of 78%. That means if you chose this model to answer your homework question, there is a good chance you would fail.

Perhaps a more useful benchmark for future models is measuring how many of these types of questions they can answer in one shot. I.e. how confident can you be when using them for real world tasks.


Replies

christianstumptoday at 3:44 PM

You are 100% correct with your assessment of the situation. But I do not agree with either of your conclusions:

1. These questions cannot and must not be compared as being similar to homework questions. These are different leagues and possibly even different sports.

2. The "more useful benchmark" that you suggest is already present in the data as we ran every model exactly once in Stage 1.