logoalt Hacker News

dwohnitmoktoday at 6:21 AM0 repliesview on HN

Gemini is an LLM. It playing chess is not relying on a non-LLM module of some sort. I'm just saying that as an LLM, Gemini has a peculiar profile compared to other LLMs (likely an artifact of its post-training process). In particular Gemini is very capable, but also quite misaligned (it will more often actively sabotage users).

> then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot

That's overly reductive. That would be true if we didn't see improvement over time from the other LLMs but we clearly do. In particular, even if Gemini is benchmarkmaxxing, this means that LLMs from other labs will eventually get there as well. Benchmarkmaxxing can be thought of as "premature" reaching of benchmarks. But I can't think of a single benchmark that was benchmarkmaxxed that wasn't eventually saturated by every single LLM provider (because being able to benchmarkmaxx serves as an existence proof that there is an LLM capable of it and as more training gets done on the LLMs the other ones get there).