Chatbot Arena is notoriously unreliable for several reasons. First it's (at least in theory) based on normal human feedback. Given by normal people's current voting trends, they clearly are not very good at identifying experts or at least remotely correct statements. Second, the leaderboards are gamed hard by the big companies. Even ARC AGI entered the actively gamed stage by now. Sure the current gen models are certainly better than the last and if two are vastly different in leaderboards there may be something fundamental to it, but there is hardly any reason to use these kinds of comparison tables for anything useful among the latest models.