Did you look at the ARC AGI 2? Codex might be overfit for terminal bench
ARC AGI 2 has a training set that model providers can choose to train on, so really wouldn't recommend using it as a general measure of coding ability.
ARC AGI 2 has a training set that model providers can choose to train on, so really wouldn't recommend using it as a general measure of coding ability.