But that's why you shouldn't expect local models to provide quick real-time answers, at least not with the same smarts as SOTA models running in the cloud. Slow batched inference (if possible - RAM capacity can obviously be a challenge with typical models and end-user hardware) can be a lot more effective.
My point is that it is WAY more efficient if we put the world's DRAM supply into a shared inference pool instead of stranding it in local machines where it won't have as high of batch size or utilization.
The cost of not being efficient is even higher DRAM costs than we have now, given supply and demand.