logoalt Hacker News

fishphamyesterday at 6:02 PM4 repliesview on HN

Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)


Replies

layer8yesterday at 6:39 PM

Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?

show 3 replies
jstummbilligyesterday at 6:38 PM

Could it also be that the models are just a lot better than a year ago?

show 1 reply
olalondeyesterday at 6:56 PM

Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.

show 2 replies