I think my problem is that I’m not sure I understand whether you evals are testing language abilities or reasoning abilities.
It seems to present results as if they’re testing language abilities, but the problems seem to be reasoning problems.