My feeling is that for agentic tasks this is not only language design but also LSPs, error messages and static analysis capabilities that dominate the benchmarks. It would IMHO be interesting to look into better subsets of python and style/rewrite techniques as well as alternative linter and their effects on performance.
But then why does JS score 50% better? (Almost identical to TypeScript.)
Actually, JS can get a surprising amount of "intellisense" as well. Not sure if that was used here though.
[dead]
A strict compiler is basically a free feedback loop for the LLM.