They will be, and that moment is not that far off. We've got the progression in place already: first, large data centers could have performant LLMs, we are now firmly in "a bunch of servers with a couple of H100s each" territory, slowly going into "128 GB VRAM on a MacBook Pro or a Strix Halo". Within the next year, the pattern of "expensive remote LLM for planning, local slow-but-faster-than-human LLM for execution" will become the norm for companies, slowly moving to "using local LLM for everything is good enough". And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed. The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.
You are greatly underestimating the hardware requirements for productive local LLMs. Research consistently shows that parameter count sets the practical ceiling for a model's reliability. Quantized models with double digit param counts will never be reliable enough to achieve results in the realm of something like Opus 4.6.
I think it's inevitable that access to good enough LLM models will be democratised.
However that's not the real battle here. The real battle is control of information to operate over.
While I might have access to a decent model - I don't have the huge integrated databases of everything that companies like Google have, and increasingly governments will accumulate.
As a citizen AI operating of these large datasets is where the concern should be.
How fast do you reckon most people will be able to afford 128-256GB of RAM?
Do you think small models will arrive? I mean if I need to write a web application in typescript why should I use a model that knows all the programming languages and it is able to reply to any questions about almost everything? I just a need a small performant model that knows how to write web applications in typescript. That could be very helpful and easy to run on my laptop.
> how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.
Nvidia and other hardware sellers would love if they could sell a bunch of chips to individual consumers that would sit idle for 95% of its life.
> The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.
This will depend on how much inference happens for consumer (desktop, local) vs enterprise ("cloud"), vs consumer mobile (probably also cloud).
I would assume that the proportion of "consumer, local" is small relative to enterprise and mobile.
The biggest impact of local models may simply be that they prevent remote inference from becoming the only game in town
Certainly, I don't think Data centers are the way here.
I guess, it'll most likely be an AI processing and everything else becoming API.
In case of GPTs and Claudes of the world. They'll be just using an Indexing APIs and KB on top of their LLMs.
Except you will want the frontier to compete. Local models are useful but you will always need $$$ to be in the same order of magintude as frontier. And also $$$ for same token speed.
The question is would you choose to save $10 a day if it causes your inference to slow down 10x and waste 2 hours a day waiting on stuff.
This is simply delusional, It cost 20-30k a month to run Kimi 2.6. The tokens are sold for $3 per mm.
To sell tokens profitably you'd need to be able to run inference at 150 tokens per second for less than $1,000 USD a month.
I don't think people realize how expensive it is to host decently capable models and how much their use of capable models is subsidized.
You can only squeeze so many parameters on consumer grade hardware(that's actually affordable, two 4090s is not consumer grade and neither is 128gb macbooks, this is incredibly expensive for the average person, and the models you can still run are not "good enough" they are still essentially useless).
People are betting their competency on a future where billionaires are forever generous, subsidizing inference at a 10-1 20-1 loss ratio. Guess what, that WILL end and probably soon. This idea that companies can afford to give you access to 2mm in GPUs for 5 hours a day at a rate of $200.00 a month is simply unsustainable.
Right now they are trying to get you hooked, DON'T FALL FOR IT. Study, work hard, sweat and you'll reap the benefits. The guy making handmade watches, one a month in Switzerland makes a whole lot more than the guy running a manufacturing line make 50k in China. Just write your own fkin code people.
Don't bet your future on having access to some billionaire's thinking machine. Intelligence, knowledge and competency isn't fungible, the llm hype is a lie to convince you that it is.
> They will be, and that moment is not that far off.
It's here, right now. I'm running quantized Qwen and Gemma on a decent, but three years old gaming rig (think RTX 3080 12GB and 32 GB RAM). Yes, it's slow, it has a small context window. But it can (given a proper harness) run through my trip photos and categorize them. It can OCR receipts and summarize spendings. It can answer simple questions, analyze code and even write code when little context is required. Probably I could get a half-decent autocomplete out of it, if I bother with VS Code integration. "128 GB VRAM on a MacBook Pro or a Strix Halo" is already a minimum viable setup for agentic coding, I think.
> And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed.
Currently, it works exactly the other way. The cloud versions are orders of magnitude cheaper than self hosting, because sharing can utilize servers much more efficiently. Company can spend half a million bucks on a rig running GLM 5.1, and get data security, flexibility and lack of censorship, but oh it's so expensive compared to Anthropic per-seat plans.