logoalt Hacker News

Aurornisyesterday at 5:21 PM8 repliesview on HN

The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.

Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.


Replies

dongobreadyesterday at 7:31 PM

What a strangely hostile statement on an open weight model. Running like 20 benchmark evaluations isn't trivial by itself, and even updating visuals and press statements can take a few days at a tech company. It's literally been 5 days since this "new generation" of models released. GPT-5.3(-codex) can't even be called via API, so it's impossible to test for some benchmarks.

I notice the people who endlessly praise closed-source models never actually USE open weight models, or assume their drop-in prompting methods and workflow will just work for other model families. Especially true for SWEs who used Claude Code first and now think every other model is horrible because they're ONLY used to prompting Claude. It's quite scary to see how people develop this level of worship for a proprietary product that is openly distrusting of users. I am not saying this is true or not of the parent poster, but something I notice in general.

As someone who uses GLM-4.7 a good bit, it's easily at Sonnet 4.5 tier - have not tried GLM-5 but it would be surprising if it wasn't at Opus 4.5 level given the massive parameter increase.

show 2 replies
InsideOutSantayesterday at 5:54 PM

> it's comparing to last generation models (Opus 4.5 and GPT-5.2).

If it's anywhere close to those models, I couldn't possibly be happier. Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

show 1 reply
dust42yesterday at 6:08 PM

They are all just token generators without any intelligence. There is so little difference nowadays that I think in a blind test nobody will be able to differentiate the models - whether open source or closed source.

Today's meme was this question: "The car wash is only 50 meters from my house. I want to get my car washed, should I drive there or walk?"

Here is Claude's answer just right now:

"Walk! At only 50 meters (about 150 feet), it would take you less than a minute to walk there. Driving such a short distance would actually be less convenient - by the time you get in the car, start it, drive, and park, you could already be there on foot. Plus, you'd save fuel and your car would be right there waiting for you after the wash is done."

Here is ChatGPT, also right now:

"Since it’s only *50 meters* (about 160 feet), walking makes a lot more sense.

Here’s why:

* *It’s extremely close* — you’ll get there in under a minute. * *Driving would actually be less efficient* — by the time you start the car, move it, and park, you’re already there. * *Cold starts aren’t great for your engine* — very short drives don’t let the engine warm up properly. * *It saves fuel and emissions*, even if it’s a tiny amount. * You avoid the hassle of maneuvering such a short distance.

The only time driving might make sense is if:

* The weather is terrible (heavy rain, snow, extreme cold). * You have mobility issues. * The car wash requires you to stay in the vehicle and you don’t want to leave it there.

Otherwise, this is a perfect “walk it over” situation. "

Please save us all that talk about frontier and SOTA and that only the closedAI models are any good and the others are all so bad and benchmaxxed. For most purposes a Toyota is just as good as a BMW or a Merc or whatever luxury brand tickles your fancy. Even worse, the lastest 80B Qwen Next is not far from Opus 4.6 but runs on my laptop.

show 18 replies
throwup238yesterday at 5:39 PM

> Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

Agreed. I think the problem is that while they can innovate at algorithms and training efficiency, the human part of RLHF just doesn't scale and they can't afford the massive amount of custom data created and purchased by the frontier labs.

IIRC it was the application of RLHF which solved a lot of the broken syntax generated by LLMs like unbalanced braces and I still see lots of these little problems in every open source model I try. I don't think I've seen broken syntax from the frontier models in over a year from Codex or Claude.

show 2 replies
miki123211yesterday at 8:27 PM

Anthropic, OpenAI and Google have real user data that they can use to influence their models. Chinese labs have benchmarks. Once you realize this, it's obvious why this is the case.

You can have self-hosted models. You can have models that improve based on your needs. You can't have both.

show 1 reply
ionelaipatioaeiyesterday at 6:03 PM

I think the only advantage that closed models have are the tools around them (claude code and codex). At this point if forced I could totally live with open models only if needed.

show 3 replies
cmrdporcupineyesterday at 5:33 PM

I tried GLM 5 by API earlier this morning and was impressed.

Particularly for tool use.

yieldcrvyesterday at 5:41 PM

come on guys, you were using Opus 4.5 literally a week ago and don't even like 4.6

something that is at parity with Opus 4.5 can ship everything you did in the last 8 weeks, ya know... when 4.5 came out

just remember to put all of this in perspective, most of the engineers and people here haven't even noticed any of this stuff and if they have are too stubborn or policy constrained to use it - and the open source nature of the GLM series helps the policy constrained organizations since they can theoretically run it internally or on prem.

show 1 reply