That score is in the arc technical paper [1]. It's the full benchmark score using this harness [2] (which is just open code with read, grep, bash tools).
This is already a solved benchmark. That's why scoring is so convoluted and a self proclaimed Agent benchmark won't allow basic agent tools. ARC has always been a bit of a nothing burger of a benchmark but this takes the cake.
[1] https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf
> For example, in a variant of environment TR87, Opus 4.6 scores 0.0% with no harness and 97.1% with the Duke harness (12), yet in environment BP35, Opus 4.6 scores 0.0% under both configuration
This is with a harness that has been designed to tackle "a small set of public environments: ls20, ft09, and vc33" (of the arc-agi-3 challenge), yet it looks like it does not solve the full arc-agi-3 benchmark, just some of them.