Isn't that essentially how the MoE models already work? Besides, if that were infinitely scalable, wouldn't we have a subset of super-smart models already at very high cost?
Besides, this would only apply for very few use cases. For a lot of basic customer care work, programming, quick research, I would say LLMs are already quite good without running it 100X.
> if that were infinitely scalable, wouldn't we have a subset of super-smart models already at very high cost
The compute/intelligence curve is not a straight line. It's probably more a curve that saturates, at like 70% of human intelligence. More compute still means more intelligence. But you'll never reach 100% human intelligence. It saturates way below that.
MoE is something different - it's a technique to activate just a small subset of parameters during inference.
Whatever is good enough now, can be much better for the same cost (time, computation, actual cost). People will always choose better over worse.
MoE models are pretty poorly named since all the "experts" are "the same". They're probably better described as "sparse activation" models. MoE implies some sort of "heterogenous experts" that a "thalamus router" is trained to use, but that's not how they work.