I'm somehow more convinced by the method shown in the introduction of the article: run a number...

wongarsu • yesterday at 8:40 PM • 1 reply • view on HN

I'm somehow more convinced by the method shown in the introduction of the article: run a number of evals across model providers, see how they compare. This also catches all other configuration changes an inference provider can make, like KV-cache quantization. And it's easy to understand, talk about, and the threat model is fairly clear (be wary of fixed answers to your benchmark if you're really distrustful)

Of course conceptually attestation is neat and wastes less compute with repeated benchmarks. It definitely has its place

Replies

Aurornis • yesterday at 9:25 PM

This comes up so frequent that I’ve seen at least 3-4 different websites running daily benchmarks on providers and plotting their performance.

The last one I bookmarked has already disappeared. I think they’re generally vibe coded by developers who think they’re going to prove something but then realize it’s expensive to spend that money on tokens every day.

They also use limited subsets of big benchmarks because to keep costs down, which increases the noise of the results. The last time someone linked to one of the sites claiming a decline in quality looked like a noisy mostly flat graph that someone had put a regression line on that was very slightly sloping downward.

alt Hacker News

Replies