Sure but where is the demand going to come from? LLMs are already in every google search, in Whatsapp/Messenger, throughout Google workspace, Notion, Slack, etc. ChatGPT already has a billion users.
Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc. I just don't see where the 100-1000x demand comes from to offset this. Would be happy to hear other views.
As plenty of others have mentioned here, if inference were 100x cheaper, I would run 200x inference.
There are so many things you can do with long running, continuous inference.
We are nearly infinitely far away from saturating compute demand for inference.
Case in point; I'd like something that realtime assesses all the sensors and API endpoints of stuff in my home and as needed bubbles up summaries, diaries, and emergency alerts. Right now that's probably a single H200, and well out of my "value range". The number of people in the world that do this now at scale is almost certainly less than 50k.
If that inference cost went to 1%, then a) I'd be willing to pay it, and b) there'd be enough of a market that a company could make money integrating a bunch of tech into a simple deployable stack, and therefore c) a lot more people would want it, likely enough to drive more than 50k H200s worth of inference demand.
> Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc.
Is that true? BLS estimates of customer service reps in the US is 2.8M (https://www.bls.gov/oes/2023/may/oes434051.htm), and while I'll grant that's from 2023, I would wager a lot that the number is still above 2M. Similarly, the overwhelming majority of software developers haven't lost their jobs to AI.
A sufficiently advanced LLM will be able to replace most, if not all of those people. Penetration into those areas is very low right now relative to where it could be.
We've seen several orders of magnitude improvements in cpus over the years, yet you try to do anything now and interaction is often slower than that on zx spectrum. We can easily fill in order of magnitude improvement and that's only going to create more demand. We can/will have models thinking for us all the time, in parallel and bother us with findings/final solutions only. There is no limit here really.
I’m already throughput-capped on my output via Claude. If you gave me 10x the token/s I’d ship at least twice as much value (at good-enough for the business quality, to be clear).
There are plenty of usecases where the models are not smart enough to solve the problem yet, but there is very obviously a lot of value available to be harvested from maturing and scaling out just the models we already have.
Concretely, the $200/mo and $2k/ mo offerings will be adopted by more prosumer and professional users as the product experience becomes more mature.
The difference in usefulness between ChatGPT free and ChatGPT Pro is significant. Turning up compute for each embedded usage of LLM inference will be a valid path forward for years.
The problem is that unless you have efficiency improvements that radically alter the shape of the compute vs smartness curve, more efficient compute translates to much smarter compute at worse efficiency.
If you can make an LLM solve a problem but from 100 different angles at the same time, that's worth something.
I mean 640KB should be enough for anyone too but here we are. Assuming LLMs fulfill the expected vision, they will be in everything and everywhere. Think about how much the internet has permeated everyday life. Even my freaking toothbrush has WiFi now! 1000x demand is likely several orders of magnitude too low in terms of the potential demand (again, assuming LLMs deliver on the promise).
Long running agents?
If LLMs were next to free and faster I would personally increase my consumption 100x or more and Im only the "programming" category.