Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)
Could it also be that the models are just a lot better than a year ago?
Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.
Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?