I think there's also laws of physics based on the current architecture. Its like saying looking at a 10GB video file and saying - it has to compress to 500MB right? I mean, it has to - right?
Unless we invent a completely NEW way of doing videos, there's no way you can get that kind of efficiency. If tomorrow we're using quantum pixels (or something), sure 500MB is good enough but not from existing.
In other words, you cannot compress a 100GB gguf file into .. 5GB.
There surely are limits, but I don't think we have a good idea of what those are, and there's nothing to indicate we're anywhere close to them. In terms of raw facts, you can look at the information content and know that you need at least that many bits to represent that knowledge in the model. Intelligence/reasoning is a lot less clear.
100GB to 5GB would be 20x. Video has seen an improvement of that magnitude in the days since MPEG-1.
It's interesting to consider that improvements in video codecs have come from both research and massively increased computing power, basically trading space for computation. LLMs are mostly constrained by memory bandwidth, so if there was some equivalent technique to trade space for computation in LLM inference, that would be a nice win.