logoalt Hacker News

famouswafflestoday at 2:04 PM0 repliesview on HN

>The point of ARC-AGI is to test the intelligence of AI systems in novel, but simple, environments.

The point is whatever Francois wants it to be.

>Having a human give it more powerful tools in a harness defeats the purpose.

Why does it defeat the purpose? Restricting the tools available is an arbitrary constraint. The Duke harness is a few basic tools. What's the problem ? In what universe would any AI Agent worth its salt not have access to read, grep and bash ? If his benchmark was as great and the difference as wide as he claimed, then it simply wouldn't matter if those tools were available. Francois removed access to tools because his benchmark falls apart with them. Simple as.

>You should go back and read the original ARC-AGI paper to see what this is about+.

>Are you upset about the benchmark because frontier LLM models do so poorly exhibiting the ability to generalize when the benchmarks are released?

I’m not upset about anything. I do not care about ARC, and I never have. I think it is a nothingburger of a benchmark: lots of grand claims about AGI, but very little predictive power or practical utility.

When models started climbing FrontierMath, that benchmark actually told us something useful: their mathematical capabilities were becoming materially stronger. And now state-of-the-art systems have helped with real research and even contributed to solving open problems. That is what a good benchmark is supposed to do.

ARC ? Has 0 utility on its own and manages to tell you nothing at the same time.

Unsaturated benchmarks matter because they help show where the state of the art actually is. The value is not “look, the score is low,” but whether the benchmark tells you something real and useful about capability. ARC has always struggled on that front, but 3 has taken that to a new level of useless.