Until there is some drastic new hardware, we are going to see a similar situation to proof of work, where a small group hordes the hardware and can collude on prices.
Difference is that the current prices have a lot of subsidies from OPM
Once the narrative changes to something more realistic, I can see prices increase across the board, I mean forget $200/month for codex pro, expect $1000/month or something similar.
So its a race between new supply of hardware with new paradigm shifts that can hit market vs tide going out in the financial markets.
Doubtful, local models are the competitive future that will keep prices down.
128GB is all you need.
A few more generations of hardware and open models will find people pretty happy doing whatever they need to on their laptop locally with big SOTA models left for special purposes. There will be a pretty big bubble burst when there aren't enough customers for $1000/month per seat needed to sustain the enormous datacenter models.
Apple will win this battle and nvidia will be second when their goals shift to workstations instead of servers.
> Until there is some drastic new hardware
For inference, there is already a 10x improvement possible over a setup based on NVIDIA server GPUs, but volume production, etc... will take a while to catch up.
During inference the model weights are static, so they can be stored in High Bandwidth Flash (HBF) instead of High Bandwidth Memory (HBM). Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.
NVIDIA GPUs are general purpose. Sure, they have "tensor cores", but that's a fraction of the die area. Google's TPUs are much more efficient for inference because they're mostly tensor cores by area, which is why Gemini's pricing is undercutting everybody else despite being a frontier model.
New silicon process nodes are coming from TSMC, Intel, and Samsung that should roughly double the transistor density.
There's also algorithmic improvements like the recently announced Google TurboQuant.
Not to mention that pure inference doesn't need the crazy fast networking that training does, or the storage, or pretty much anything other than the tensor units and a relatively small host server that can send a bit of text back and forth.