The LLMs do play rated engines (maia and eubos). They provide the baselines. Gemini e.g. consistently beats the different maia versions.
The rest is taken care of by elo. That is they then play each other as well, but it is not really possible for Gemini to have a higher elo than maia with such a small sample size (and such weak other LLMs).
Elo doesn't let you inflate your score by playing low ranked opponents if there are known baselines (rated engines) because the rated engines will promptly crush your elo.
You could add humans into the mix, the benchmark just gets expensive.
I did indeed miss something. I learned after posting (but before my EDIT) that there are anchor engines that they play.
However these benchmarks still have flaws. The two illegal moves = forfeit is an odd rule which the authors of the benchmarks (which in this case was Claude Code) added[1] for mysterious reasons. In competitive play if you play an illegal move you forfeit the game.
Second (and this is a minor one) Maia 1900 is currently rated at 1774 on lichess[2], but is 1816 on the leaderboard, to the author’s credit they do admit this in their methodology section.
Third, and this is a curiosity, gemini-3-pro-preview seems to have played the same game twice against Maia 1900[3][4] and in both cases Maia 1900 blundered (quite suspiciously might I add) mate in one when in a winning position with Qa3?? Another curiosity about this game. Gemini consistently played the top 2 moves on lichess. Until 16. ...O-O! (which has never been played on lichess) Gemini had played 14 most popular lichess moves, and 2 second most popular. That said I’m not gonna rule out that the fact that this game is listed twice might stem from an innocent data entry error.
And finally, apart from Gemini (and Survival bot for some reason?), LLMs seem unable to pass Maia-1100 (rated 1635 on lichess). The only anchor bot before that is random bot. And predictably LLMs cluster on both sides of it, meaning they play as well as random (apart from the illegal moves). This smells like benchmaxxing from Gemini. I would guess that the entire lichess repertoire features prominently in Gemini’s training data, and the model has memorized it really well. And is able to play extremely well if it only has to play 5-6 novel moves (especially when their opponent blunders checkmate in 1).
1: https://github.com/lightnesscaster/Chess-LLM-Benchmark/commi...
2: https://lichess.org/@/maia9
3: https://chessbenchllm.onrender.com/game/6574c5d6-c85a-4cb3-b...
4: https://chessbenchllm.onrender.com/game/4af82d60-8ef4-47d8-8...