Just tested it on my homemade Wordpress+GravityForms benchmark and it's one of the worst model of the leaderboard performance wise and the worst value wise: https://github.com/guilamu/llms-wordpress-plugin-benchmark
I know it's only on a single benchmark, but I dont understand how it can be so bad...
Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.
You even traveled in time to deliver us this benchmark.
I really like this benchmarking. Have you evaluated the judge benchmark somehow? I'd love to setup my own similar benchmark.
Seems like benchmark for how good a model is for vibe coding.
Your prompt is extremely slim yet you score it on a bunch of features.
[dead]
gemma4-e4b is 50% better than gemma4-26b in your benchmark, something's wrong