> Parameter count was used as a measure for how great the proprietary models were until GPT3, the...

derefr • yesterday at 8:24 PM • 0 replies • view on HN

> Parameter count was used as a measure for how great the proprietary models were until GPT3, then it suddenly stopped.

AFAICT that's mostly because what you're getting when you select a "model" from most of these cloud chat model providers today, isn't a specific concrete model, but rather is a model family, where your inference request is being routed to varying models within the family during the request. There's thus no one number of weights for "the model", since several entirely-independent models can be involved in generating each response.

And to be clear, I'm not just talking about how selecting e.g. "ChatGPT 5.2" sometimes gets you a thinking model and sometimes doesn't, etc.

I'm rather saying that, even when specifically requesting the strongest / most intelligent "thinking" models, there are architectural reasons that the workload could be (and probably is) routed to several component "sub-models", that handle inference during different parts of the high-level response "lifecycle"; with the inference framework detecting transition points in the response stream, and "handing off" the context + response stream from one of these "sub-models" to another.

(Why? Well, imagine how much "smarter" a model could be if it had a lot more of its layers available for deliberation, because it didn't have to spend so many layers on full-fat NLP parsing of input or full-fat NLP generation of output. Split a model into a pipeline of three sub-models, where the first one is trained to "just understand" — i.e. deliberate by rephrasing whatever you say to it into simpler terms; the second one is trained to "just think" — i.e. assuming pre-"understood" input and doing deep scratch work in some arbitrary grammar to eventually write out a plan for a response; and the third one is trained to "just speak" — i.e. attend almost purely to the response plan and whatever context-tokens that plan attends to, to NLP-generate styled prose, in a given language, with whatever constraints the prompt required. Each of these sub-models can be far smaller and hotter in VRAM than a naive monolithic thinking model. And these sub-models can make a fixed assumption about which phase they're operating in, rather than having to spend precious layers just to make that determination, over and over again, on every single token generation step.)

And, presuming they're doing this, the cloud provider can then choose to route each response lifecycle phase to a different weight-complexity-variant for that lifecycle phase's sub-model. (Probably using a very cheap initial classifier model before each phase: context => scalar nextPhaseComplexityDemand.) Why? Because even if you choose the highest-intelligence model from the selector, and you give it a prompt that really depends on that intelligence for a response... your response will only require a complex understanding-phase sub-model if your input prose contained the high-NLP-complexity tokens that would confuse a lesser understanding-phase sub-model; and your response will only require a complex responding-phase sub-model if the thinking-phase model's emitted response plan specifies complex NLP or prompt-instruction-following requirements that only a more-complex responding-phase sub-model knows how to manage.

Which is great, because it means that now even when using the "thinking" model, most people with most requests are only holding a reservation on a GPU holding a copy of the (probably still hundreds-of-billions-of-weights) high-complexity-variant thinking-phase sub-model weights, for the limited part of that response generation lifecycle where the thinking phase is actually occurring. During the "understanding" and "responding" phases, that reservation can be released for someone else to use! And for the vast majority of requests, the "thinking" phase is the shortest phase. So users end up sitting around waiting for the "understanding" and "responding" phases to complete before triggering another inference request. Which brings the per-user duty cycle of thinking-phase sub-model use way down.

alt Hacker News