logoalt Hacker News

eistoday at 5:38 AM1 replyview on HN

The Elo rating system measures relative performance to the other models. As the other models improve or rather newer better models enter the list, the Elo score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt.

You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.


Replies

bob1029today at 8:10 AM

The relative and auto-scaling nature of Elo ranking feels like an advantage here.

Relative ranking systems extract more information per tournament. You will get something approximating the actual latent skill level with enough of them.

show 1 reply