Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...
The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview
> If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
They never will do on private set, because it would mean its being leaked to google.
Interestingly, the title of that PDF calls it "Gemini 3.1 Pro". Guess that's dropping soon.