Looking closely at the graphs, the anthropic models are clearly all higher than the openai models
Whether the difference is meaningful can’t be determined from the graphs (and picking one graph over the ensemble also doesn't have a reasoned basis given that these are all arbitrary).