logoalt Hacker News

guilamuyesterday at 8:05 PM5 repliesview on HN

Just tested it on my homemade Wordpress+GravityForms benchmark and it's one of the worst model of the leaderboard performance wise and the worst value wise: https://github.com/guilamu/llms-wordpress-plugin-benchmark

I know it's only on a single benchmark, but I dont understand how it can be so bad...


Replies

goldenarmyesterday at 8:27 PM

gemma4-e4b is 50% better than gemma4-26b in your benchmark, something's wrong

show 1 reply
ac29yesterday at 8:14 PM

Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.

show 1 reply
mosselmanyesterday at 8:18 PM

You even traveled in time to deliver us this benchmark.

I really like this benchmarking. Have you evaluated the judge benchmark somehow? I'd love to setup my own similar benchmark.

show 1 reply
DrProticyesterday at 8:26 PM

Seems like benchmark for how good a model is for vibe coding.

Your prompt is extremely slim yet you score it on a bunch of features.

show 1 reply
gizmodo59today at 12:27 AM

[dead]