logoalt Hacker News

skysnipertoday at 4:17 PM5 repliesview on HN

I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness.

The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.

The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.

Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.

Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn

I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.


Replies

vessenestoday at 7:02 PM

Cheapest just isn't a very useful metric. Can I suggest a Pareto-curve type representation? Cost / request vs ELO would be useful and you have all the data.

show 1 reply
johndoughtoday at 6:01 PM

Could you add a column for time or number of tokens? Some models take forever because of their excessive reasoning chains.

show 1 reply
refulgentistoday at 4:31 PM

Please don’t use AI to write comments, it cuts against HN guidelines.

show 2 replies
citizenpaultoday at 6:14 PM

>Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance

This has also been my subjective experience But has also been objective in terms of cost.