It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd ...

Tiberium • today at 10:29 AM • 11 replies • view on HN

It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file.

I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.

Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.

Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.

Replies

benjiro29 • today at 11:27 AM

GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The thinking chain is so similar, and so is the amount of token usage on the output.

If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.

In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.

There has been really no training on Opus models going on, really, none i tell you! /sarcasm

➕ show 3 replies

alexjplant • today at 5:37 PM

> It seems to really be a nice step-up and is getting quite close to the frontier.

IMHO it's already surpassed them. I vastly prefer my personal GLM and OpenCode setup to the Claude Code and Opus one that I have to use at work. The former makes way fewer StackOverflow brogrammer-tier mistakes and is considerably better at following instructions. The harness UX is also vastly superior as it doesn't ignore, randomly change, or incorrectly report settings.

Maybe it's the harness and I'd have even greater success with OpenCode and Anthropic, but I think it safe to say that Anthropic's moat is evaporating.

vorticalbox • today at 11:00 AM

This is a problem I find with opus is will spend so long thinking then going “but wait what if”

To point where I stop it and simple tell it to “start writing code you can work it out as you go along”

Seems writers block also effects LLM

➕ show 6 replies

h14h • today at 12:57 PM

Hopefully the recent work Moonshot did with Kimi K2.7 Code trickles in to the other open-model labs.

Per AA, while K2.7 Code is roughly on par w/ K2.6 in terms of intelligence, it uses half the output tokens to get there.

bertili • today at 10:37 AM

This is GLM 5.2 Max. GLM 5.2 High which use less than half[1] the tokens.

[1] https://z.ai/blog/glm-5.2

➕ show 1 reply

robmccoll • today at 12:40 PM

That's interesting. I gave nearly the same task to Gemma4 31b as a test yesterday. Write a symbolic math engine in Typescript that can perform evaluation and simple expression reductions over +-/*(). It performed the task correctly with minimal reasoning - much fewer reasoning tokens than output tokens.

➕ show 1 reply

rdsubhas • today at 1:26 PM

As per stats in other comments, it is frontier, not close to frontier.

cmrdporcupine • today at 11:58 AM

> Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.

GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly.

And this was high, not max.

esafak • today at 2:56 PM

I agree. I've noticed that it is quite smart but it has a tendency to doubt itself and overthink. I monitor its internal dialogue and prod it when it does this. They need to optimize the chain of thought early stopping.

alt Hacker News

Replies