ARC-AGI isn't perfect, but it helps demonstrates the gap. I'm sure all companies optimize their models for this benchmark given its dominance.
What about other benchmarks? Benchmarks where the contents are freely available have become useless for evaluating models.
What about other benchmarks? Benchmarks where the contents are freely available have become useless for evaluating models.