> For example, in a variant of environment TR87, Opus 4.6 scores 0.0% with no harness and 97.1% ...

vbarrielle • yesterday at 4:47 PM • 1 reply • view on HN

> For example, in a variant of environment TR87, Opus 4.6 scores 0.0% with no harness and 97.1% with the Duke harness (12), yet in environment BP35, Opus 4.6 scores 0.0% under both configuration

This is with a harness that has been designed to tackle "a small set of public environments: ls20, ft09, and vc33" (of the arc-agi-3 challenge), yet it looks like it does not solve the full arc-agi-3 benchmark, just some of them.

Replies

famouswaffles • yesterday at 4:51 PM

The harness was designed with the preview, but no it was still tested on the full public set in that environment. You can run the benchmark in different 'environments' though it's unclear what the difference between them is.

>We then tested the harnesses on the full public set (which researchers did not have access to at the time)

➕ show 1 reply

alt Hacker News

Replies