Replying in a split thread to clearly separate where I was wrong. If Gemini is so good at chess be...

runarberg • yesterday at 6:57 PM • 1 reply • view on HN

Replying in a split thread to clearly separate where I was wrong.

If Gemini is so good at chess because of a non-LLM feature of the model, then it is kind of disingenuous to rate it as an LLM and claim that LLMs are approaching 2000 ELO. But the fact it still plays illegal moves sometimes, is biased towards popular moves, etc. makes me think that chess is still handled by an LLM, and makes me suspect benchmaxxing.

But even if no foul play, and Gemini is truly a capable chess player with nothing but an LLM underneath it, then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot. My fourth point above was my strongest point. There are only 4 anchor engines, one beats all LLMs, second beats all except Gemini, the third beats all LLMs except Gemini and Survival bot (what is Survival bot even doing there?) and the forth is random bot.

Replies

dwohnitmok • today at 6:21 AM

Gemini is an LLM. It playing chess is not relying on a non-LLM module of some sort. I'm just saying that as an LLM, Gemini has a peculiar profile compared to other LLMs (likely an artifact of its post-training process). In particular Gemini is very capable, but also quite misaligned (it will more often actively sabotage users).

> then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot

That's overly reductive. That would be true if we didn't see improvement over time from the other LLMs but we clearly do. In particular, even if Gemini is benchmarkmaxxing, this means that LLMs from other labs will eventually get there as well. Benchmarkmaxxing can be thought of as "premature" reaching of benchmarks. But I can't think of a single benchmark that was benchmarkmaxxed that wasn't eventually saturated by every single LLM provider (because being able to benchmarkmaxx serves as an existence proof that there is an LLM capable of it and as more training gets done on the LLMs the other ones get there).

alt Hacker News

Replies