One-shot performance often translates to the most difficult problems a model will be able to understand. We run an evaluation that tests both agentic and one-shot performance, and we find that Chinese models are almost universally very good at using tools and a harness to iterate towards a better solution, whereas their initial response ranks relatively low.
Compare that to Gemini models, which have impressive fluid intelligence on the first response, but fail to call tools or explore correctly which limits their usefulness for agentic coding.
Neither will be great for coding in a computational chemistry repo for different reasons, but the model with strong one-shot performance will be less likely to make subtle errors indicative of poor understanding, so we weight both capabilities into their final score.
The latest Anthropic and OpenAI models excel in both domains.
Data at https://gertlabs.com/rankings