logoalt Hacker News

runarbergyesterday at 5:32 PM1 replyview on HN

I did indeed miss something. I learned after posting (but before my EDIT) that there are anchor engines that they play.

However these benchmarks still have flaws. The two illegal moves = forfeit is an odd rule which the authors of the benchmarks (which in this case was Claude Code) added[1] for mysterious reasons. In competitive play if you play an illegal move you forfeit the game.

Second (and this is a minor one) Maia 1900 is currently rated at 1774 on lichess[2], but is 1816 on the leaderboard, to the author’s credit they do admit this in their methodology section.

Third, and this is a curiosity, gemini-3-pro-preview seems to have played the same game twice against Maia 1900[3][4] and in both cases Maia 1900 blundered (quite suspiciously might I add) mate in one when in a winning position with Qa3?? Another curiosity about this game. Gemini consistently played the top 2 moves on lichess. Until 16. ...O-O! (which has never been played on lichess) Gemini had played 14 most popular lichess moves, and 2 second most popular. That said I’m not gonna rule out that the fact that this game is listed twice might stem from an innocent data entry error.

And finally, apart from Gemini (and Survival bot for some reason?), LLMs seem unable to pass Maia-1100 (rated 1635 on lichess). The only anchor bot before that is random bot. And predictably LLMs cluster on both sides of it, meaning they play as well as random (apart from the illegal moves). This smells like benchmaxxing from Gemini. I would guess that the entire lichess repertoire features prominently in Gemini’s training data, and the model has memorized it really well. And is able to play extremely well if it only has to play 5-6 novel moves (especially when their opponent blunders checkmate in 1).

1: https://github.com/lightnesscaster/Chess-LLM-Benchmark/commi...

2: https://lichess.org/@/maia9

3: https://chessbenchllm.onrender.com/game/6574c5d6-c85a-4cb3-b...

4: https://chessbenchllm.onrender.com/game/4af82d60-8ef4-47d8-8...


Replies

dwohnitmokyesterday at 5:59 PM

> The two illegal moves = forfeit is an odd rule which the authors of the benchmarks (which in this case was Claude Code) added[1] for mysterious reasons. In competitive play if you play an illegal move you forfeit the game.

This is not true. This is clearly spelled out in FIDE rules and is upheld at tournaments. First illegal move is a warning and reset. Second illegal move is forfeit. See here https://rcc.fide.com/article7/

I doubt GDM is benchmarkmaxxing on chess. Gemini is a weird model that acts very differently from other LLMs so it doesn't surprise me that it has a different capability profile.

show 2 replies