logoalt Hacker News

vbarrielleyesterday at 4:47 PM1 replyview on HN

> For example, in a variant of environment TR87, Opus 4.6 scores 0.0% with no harness and 97.1% with the Duke harness (12), yet in environment BP35, Opus 4.6 scores 0.0% under both configuration

This is with a harness that has been designed to tackle "a small set of public environments: ls20, ft09, and vc33" (of the arc-agi-3 challenge), yet it looks like it does not solve the full arc-agi-3 benchmark, just some of them.


Replies

famouswafflesyesterday at 4:51 PM

The harness was designed with the preview, but no it was still tested on the full public set in that environment. You can run the benchmark in different 'environments' though it's unclear what the difference between them is.

>We then tested the harnesses on the full public set (which researchers did not have access to at the time)

show 1 reply