Notice how all the major AI companies (at least the ones that don't do open releases) stopped t...

wongarsu • yesterday at 9:23 AM • 5 replies • view on HN

Notice how all the major AI companies (at least the ones that don't do open releases) stopped telling us how many parameters their models have. Parameter count was used as a measure for how great the proprietary models were until GPT3, then it suddenly stopped.

And how inference prices have come down a lot, despite increasing pressure to make money. Opus 4.6 is $25/MTok, Opus 4.1 was $75/MTok, the same as Opus 4 and Opus 3. OpenAI's o1 was $60/MTok, o1 pro $600/MTok, gpt-5.2 is $14/MTok and 5.2-pro is $168/MTok.

Also note how GPT-4 was rumored to be in the 1.8T realm, and now Chinese models in the 1T realm can match or surpass it. And I doubt the Chinese have a monopoly on those efficiency improvements

I doubt frontier models have actually substantially grown in size in the last 1.5 years, and potentially have a lot fewer parameters than the frontier models of old

Replies

irthomasthomas • yesterday at 10:51 AM

You're hitting on something really important that barely gets discussed. For instance, notice how opus 4.5's speed essentially doubled, bringing it right in line with the speed of sonnet 4.5? (sonnet 4.6 got a speed bump too, though closer to 25%).

It was the very first thing I noticed: it looks suspiciously like they just rebranded sonnet as opus and raised the price.

I don't know why more people aren't talking about this. Even on X, where the owner directly competes in this market, it's rarely brought up. I strongly suspect there is a sort of tacit collusion between competitors in this space. They all share a strong motivation to kill any deep discussion of token economics, even about each other because transparency only arms the customers. By keeping the underlying mechanics nebulous, they can all justify higher prices. Just look at the subscription tiers: every single major player has settled on the exact same pricing model, a $20 floor and a $200 cap, no exceptions.

➕ show 7 replies

magicalhippo • yesterday at 4:05 PM

From what I've gathered, they've been mostly training limited. Better training methods and cleaner training data allows smaller models to rival or outperform larger models training with older methods and lower-quality training data.

For example, the Qwen3 technical report[1] says that the Qwen3 models are architecturally very similar to Qwen2.5, with the main change being a tweak in the attention layers to stabilize training. And if you compare table 1 in Qwen3 paper with table 1 in Qwen 2.5 technical report[2], the layer count, attention configuration and such is very similar. Yet Qwen3 was widely regarded as a significant upgrade to Qwen2.5.

However, for training, they doubled the pre-training token count, and tripled the number of languages. It's been shown that training on more languages can actually help LLMs generalize better. They used Qwen2.5 VL and Qwen 2.5 to generate additional training data by parsing a large number PDFs and turning them into high quality training tokens. They improved their annotation so they could more effectively provide diverse training tokens to the model, improving training efficiency.

They continued this trend with Qwen3.5, where even more and better training data[3] made their Qwen3.5-397B-A17B model match the 1T-parameter Qwen3-Max-Base.

That said there's also been a lot of work on model architecture[4], getting more speed and quality per parameter. In the case of Qwen3-Next architecture which 3.5 is based on, that means such things as hybrid attention for faster long-context operation, and sparse MoE and multi-token prediction for less compute per output token.

I used Qwen as an example here, from what I gather they're just an example of the general trend.

[1]: https://arxiv.org/abs/2505.09388

[2]: https://arxiv.org/abs/2412.15115

[3]: https://qwen.ai/blog?id=qwen3.5

[4]: https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d...

musebox35 • yesterday at 12:50 PM

Similar trend in open text-to-image models: Flux.1 was 12B but now we have 6B models with much better quality. Qwen Image goes from 20B to 7B while merging the edit line and improving quality. Now that the cost of spot H200s at 140GB came down to A100 levels, you can finally try larger scale finetuning/distillation/rl with these models. Very promising direction for open tools and models if the trend continues.

derefr • yesterday at 8:24 PM

> Parameter count was used as a measure for how great the proprietary models were until GPT3, then it suddenly stopped.

AFAICT that's mostly because what you're getting when you select a "model" from most of these cloud chat model providers today, isn't a specific concrete model, but rather is a model family, where your inference request is being routed to varying models within the family during the request. There's thus no one number of weights for "the model", since several entirely-independent models can be involved in generating each response.

And to be clear, I'm not just talking about how selecting e.g. "ChatGPT 5.2" sometimes gets you a thinking model and sometimes doesn't, etc.

I'm rather saying that, even when specifically requesting the strongest / most intelligent "thinking" models, there are architectural reasons that the workload could be (and probably is) routed to several component "sub-models", that handle inference during different parts of the high-level response "lifecycle"; with the inference framework detecting transition points in the response stream, and "handing off" the context + response stream from one of these "sub-models" to another.

(Why? Well, imagine how much "smarter" a model could be if it had a lot more of its layers available for deliberation, because it didn't have to spend so many layers on full-fat NLP parsing of input or full-fat NLP generation of output. Split a model into a pipeline of three sub-models, where the first one is trained to "just understand" — i.e. deliberate by rephrasing whatever you say to it into simpler terms; the second one is trained to "just think" — i.e. assuming pre-"understood" input and doing deep scratch work in some arbitrary grammar to eventually write out a plan for a response; and the third one is trained to "just speak" — i.e. attend almost purely to the response plan and whatever context-tokens that plan attends to, to NLP-generate styled prose, in a given language, with whatever constraints the prompt required. Each of these sub-models can be far smaller and hotter in VRAM than a naive monolithic thinking model. And these sub-models can make a fixed assumption about which phase they're operating in, rather than having to spend precious layers just to make that determination, over and over again, on every single token generation step.)

And, presuming they're doing this, the cloud provider can then choose to route each response lifecycle phase to a different weight-complexity-variant for that lifecycle phase's sub-model. (Probably using a very cheap initial classifier model before each phase: context => scalar nextPhaseComplexityDemand.) Why? Because even if you choose the highest-intelligence model from the selector, and you give it a prompt that really depends on that intelligence for a response... your response will only require a complex understanding-phase sub-model if your input prose contained the high-NLP-complexity tokens that would confuse a lesser understanding-phase sub-model; and your response will only require a complex responding-phase sub-model if the thinking-phase model's emitted response plan specifies complex NLP or prompt-instruction-following requirements that only a more-complex responding-phase sub-model knows how to manage.

Which is great, because it means that now even when using the "thinking" model, most people with most requests are only holding a reservation on a GPU holding a copy of the (probably still hundreds-of-billions-of-weights) high-complexity-variant thinking-phase sub-model weights, for the limited part of that response generation lifecycle where the thinking phase is actually occurring. During the "understanding" and "responding" phases, that reservation can be released for someone else to use! And for the vast majority of requests, the "thinking" phase is the shortest phase. So users end up sitting around waiting for the "understanding" and "responding" phases to complete before triggering another inference request. Which brings the per-user duty cycle of thinking-phase sub-model use way down.

5o1ecist • yesterday at 9:42 AM

> I doubt frontier models have actually substantially grown in size in the last 1.5 years

... and you'd be most likely very correct with your doubt, given the evidence we have.

What improved disproportionally more than the software- or hardware-side, is density[1]/parameter, indicating that there's a "Moore's Law"-esque behind the amount of parameters, the density/parameter and compute-requirements. As long as more and more information/abilities can be squeezed into the same amount of parameters, inference will become cheaper and cheaper quicker and quicker.

I write "quicker and quicker", because next to improvements in density there will still be additional architectural-, software- and hardware-improvements. It's almost as if it's going exponential and we're heading for a so called Singularity.

Since it's far more efficient and "intelligent" to have many small models competing with and correcting each other for the best possible answer, in parallel, there simply is no need for giant, inefficient, monolithic monsters.

They ain't gonna tell us that, though, because then we'd know that we don't need them anymore.

[1] for lack of a better term that I am not aware of.

➕ show 4 replies

alt Hacker News

Replies