As far as I understand, this is exactly how ELO scores work. If a more capable show up and starts beating all the other models, it literally takes ELO points from everyone else.
Depends on the test design; is an agent competing against other agent in a given match, or against a test? Plus! Does the test's ELO fluctuate?