I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performa...

skysniper • today at 4:17 PM • 5 replies • view on HN

I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness.

The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.

The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.

Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.

Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn

I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.

Replies

vessenes • today at 7:02 PM

Cheapest just isn't a very useful metric. Can I suggest a Pareto-curve type representation? Cost / request vs ELO would be useful and you have all the data.

➕ show 1 reply

johndough • today at 6:01 PM

Could you add a column for time or number of tokens? Some models take forever because of their excessive reasoning chains.

➕ show 1 reply

refulgentis • today at 4:31 PM

Please don’t use AI to write comments, it cuts against HN guidelines.

➕ show 2 replies

citizenpaul • today at 6:14 PM

>Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance

This has also been my subjective experience But has also been objective in terms of cost.

alt Hacker News

Replies