Artificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to ru...

mrngld • today at 11:42 AM • 4 replies • view on HN

Artificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to run, with GPT5.5 on medium significantly less expensive. Compared to GPT5.5 medium GLM5.1xhigh is twice the cost and half the intelligence. They don't have GLM5.2 on there yet, but that'd a big gap to bridge.

https://artificialanalysis.ai/agents/coding-agents?coding-ag...

I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.

Replies

undecidabot • today at 2:02 PM

It got 46.2 on DeepSWE in Z.ai's own run[1]. That would put it between Opus 4.7 xhigh and Opus 4.8 medium.

[1] https://z.ai/blog/glm-5.2

lukewarm707 • today at 1:29 PM

with open models you can get a subscription with privacy, at the same cost as codex.

openai, google and anthropic subscriptions are not available with privacy.

looking at the link there it's interesting that going from cursor cli to codex cli take gpt 5.5 from 7th to 3rd. but they didn't do open model in codex.

so, hard to say it's for sure a model benchmark. maybe open models are just shit at swe agent harness...it's not the most parsimonious explanation though.

➕ show 1 reply

ttul • today at 1:27 PM

DeepSWE “feels” like the right benchmark in comparison to Artificial Analysis indices and other coding benchmarks. And by their metrics, GPT-5.5 is still king in token efficiency, speed, and overall intelligence per dollar.

https://deepswe.datacurve.ai/

Fable 5 is cool and all, but we have not yet seen GPT-5.6.

cmrdporcupine • today at 11:54 AM

I gave GLM 5.2 a spin on openrouter yesterday and it was mostly fine but it racked up $5 in token use in 30 minutes of (relatively slow) work.

It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.

Having better luck with MiniMax M3, from a cost/benefit ratio.

➕ show 3 replies

alt Hacker News

Replies