logoalt Hacker News

wolttamtoday at 1:08 PM1 replyview on HN

Why would your test be including scores of failed responses/runs? That seems confusing.

(I am confused by the results your website is presenting)


Replies

XCSmetoday at 1:42 PM

Because the idea of those benchmarks is to see how well a model performs in real-world scenarios, as most models are served via APIs, not self-hosted.

So, for example, hypothetically if GPT-5.5 was super intelligent, but using it via API would fail 50% of the times, then using it in a real-life scenarios would make your workflows fail a lot more often than using a "dumber", but more stable model.

My plan is to also re-test models over-time, so this should account for infrastructure improvements and also to test for model "nerfing".

show 4 replies