Why would your test be including scores of failed responses/runs? That seems confusing. (I am...

wolttam • today at 1:08 PM • 1 reply • view on HN

Why would your test be including scores of failed responses/runs? That seems confusing.

(I am confused by the results your website is presenting)

Replies

Because the idea of those benchmarks is to see how well a model performs in real-world scenarios, as most models are served via APIs, not self-hosted.

So, for example, hypothetically if GPT-5.5 was super intelligent, but using it via API would fail 50% of the times, then using it in a real-life scenarios would make your workflows fail a lot more often than using a "dumber", but more stable model.

My plan is to also re-test models over-time, so this should account for infrastructure improvements and also to test for model "nerfing".

➕ show 4 replies

alt Hacker News

Replies