Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?
I ask because I cannot distinguish all the benchmarks by heart.
Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)
Here's a good thread over 1+ month, as each model comes out
https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...
tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark
François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.
His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.