Those are supposed to be issues? After reading your list my impression of ARC-AGI has gone up rather than down. All of those things seem like the right way to go about this.
Yeah I'm quite surprised as to how all of those are supposed to be considered problems. They all make sense to me if we're trying to judge whether these tools are AGI, no?
“no harnass at all” might be an issue, though, as these types of benchmarks are often gamified and then models perform great on them without actually being better models.
They are severe problems if your income is tied to LLM hype generation.
No, those aren't issues. But it's good to know the meaning of those numbers we get. For example, 25% is about the average human level (on this category of problems). 100% is either top human level or superhuman level or the information-theoretically optimal level.