This may be objectively scored, but it is not an indication of anyone's coding capabilities. This test measures which model almost accidentally came up with the best strategy (against other bots). This is not representative of coding. You would need to test 100 or more of such puzzles, widely spread across the puzzle spectrum, to get an idea which model is best at finding strategies involving an English dictionary.
> You would need to test 100 or more of such puzzles, widely spread across the puzzle spectrum
Would you? I am not very knowledgable on LLMs, but my understanding was that each query was essentially a stateless inference with previous input/output as context. In such a case, a single puzzle, yielding hundreds of queries, is essentially hundreds of paths dependent but individual tests?
I don't think that is entirely fair.. I don't see them stating anywhere they are measuring coding capabilities... "Using complex games to probe real intelligence."
And this seems very much in line with the methodology in ARC-AGI-3.
The results here, in the OP article and in https://www.designarena.ai all tell a similar story: Kimi K2.6 is up and in the SOTA mix.