Just tested it on my homemade Wordpress+GravityForms benchmark and it's one of the worst model ...

guilamu • yesterday at 8:05 PM • 5 replies • view on HN

Just tested it on my homemade Wordpress+GravityForms benchmark and it's one of the worst model of the leaderboard performance wise and the worst value wise: https://github.com/guilamu/llms-wordpress-plugin-benchmark

I know it's only on a single benchmark, but I dont understand how it can be so bad...

Replies

goldenarm • yesterday at 8:27 PM

gemma4-e4b is 50% better than gemma4-26b in your benchmark, something's wrong

➕ show 1 reply

ac29 • yesterday at 8:14 PM

Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.

➕ show 1 reply

mosselman • yesterday at 8:18 PM

You even traveled in time to deliver us this benchmark.

I really like this benchmarking. Have you evaluated the judge benchmark somehow? I'd love to setup my own similar benchmark.

➕ show 1 reply

DrProtic • yesterday at 8:26 PM

Seems like benchmark for how good a model is for vibe coding.

Your prompt is extremely slim yet you score it on a bunch of features.

➕ show 1 reply

gizmodo59 • today at 12:27 AM

[dead]

alt Hacker News

Replies