The Elo rating system measures relative performance to the other models. As the other models improve or rather newer better models enter the list, the Elo score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt.
You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.
The relative and auto-scaling nature of Elo ranking feels like an advantage here.
Relative ranking systems extract more information per tournament. You will get something approximating the actual latent skill level with enough of them.