Yes, but benchmarks like this are often flawed because leading model labs frequently participate in ...

fishpham • yesterday at 6:02 PM • 4 replies • view on HN

Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)

Replies

layer8 • yesterday at 6:39 PM

Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?

➕ show 3 replies

jstummbillig • yesterday at 6:38 PM

Could it also be that the models are just a lot better than a year ago?

➕ show 1 reply

XenophileJKO • yesterday at 6:49 PM

https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3

➕ show 1 reply

olalonde • yesterday at 6:56 PM

Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.

➕ show 2 replies

alt Hacker News

Replies