LLMs on device is the future. It's more secure and solves the problem of too much demand for in...

babblingfish • today at 4:40 AM • 28 replies • view on HN

LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.

Replies

konschubert • today at 10:41 AM

I disagree with every sentence of this.

> solves the problem of too much demand for inference

False, it creates consumer demand for inference chips, which will be badly utilised.

> also would use less electricity

What makes you think that? (MAYBE you can save power on cooling. But not if the data center is close to a natural heat sink)

> It's just a matter of getting the performance good enough.

The performance limitations are inherent to the limited compute and memory.

> Most users don't need frontier model performance.

What makes you think that?

➕ show 4 replies

troad • today at 6:45 AM

I very recently installed llama.cpp on my consumer-grade M4 MBP, and I've been having loads of fun poking and prodding the local models. There's now a ChatGPT style interface baked into llama.cpp, which is very handy for quick experimentation. (I'm not entirely sure what Ollama would get me that llama.cpp doesn't, happy to hear suggestions!)

There are some surprisingly decent models that happily fit even into a mere 16 gigs of RAM. The recent Qwen 3.5 9B model is pretty good, though it did trip all over itself to avoid telling me what happened on Tiananmen Square in 1989. (But then I tried something called "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive", which veers so hard the other way that it will happily write up a detailed plan for your upcoming invasion of Belgium, so I guess it all balances out?)

➕ show 3 replies

melvinroest • today at 5:16 AM

I have journaled digitally for the last 5 years with this expectation.

Recently I built a graphRAG app with Qwen 3.5 4b for small tasks like classifying what type of question I am asking or the entity extraction process itself, as graphRAG depends on extracted triplets (entity1, relationship_to, entity2). I used Qwen 3.5 27b for actually answering my questions.

It works pretty well. I have to be a bit patient but that’s it. So in that particular use case, I would agree.

I used MLX and my M1 64GB device. I found that MLX definitely works faster when it comes to extracting entities and triplets in batches.

➕ show 1 reply

jonhohle • today at 1:54 PM

I’ve been using google search AI and Gemini, which I find generally pretty good. In the past week, Gemini and Search AI have been bringing in various details of previous searches I’ve done and Search AI conversations I’ve had and it’s extremely gross and creepy.

I was looking for details about cars and it started interjecting how the safety would affect my children by name in a conversation where I never mention my children. I was asking details about Thunderbolt and modern Ryzen processors and a fresh Gemini chat brought in details about a completely unrelated project I work on. I’ve always thought local LLMs would be important, but whatever Google did in the past few weeks has made that even more clear.

➕ show 1 reply

Aurornis • today at 2:19 PM

> solves the problem of too much demand for inference compared to data center supply

Maybe in the distant future when device compute capacity has increased by multiples and efficiency improvements have made smaller LLMs better.

The current data center buildouts are using GPU clusters and hybrid compute servers that are so much more powerful than anything you can run at home that they’re not in the same league. Even among the open models that you can run at home if you’re willing to spend $40K on hardware, the prefill and token generation speeds are so slow compared to SOTA served models that you really have to be dedicated to avoiding the cloud to run these.

We won’t be in a data center crunch forever. I would not be surprised if we have a period of data center oversupply after this rush to build out capacity.

However at the current rate of progress I don’t see local compute catching up to hosted models in quality and usability (speed) before data center capacity catches up to demand. This is coming from someone who spends more than is reasonable on local compute hardware.

AugSun • today at 5:05 AM

"Most users don't need frontier model performance" unfortunately, this is not the case.

➕ show 7 replies

karimf • today at 6:45 AM

Depending on the use case, the future is already here.

For example, last week I built a real-time voice AI running locally on iPhone 15.

One use case is for people learning speaking english. The STT is quite good and the small LLM is enough for basic conversation.

https://github.com/fikrikarim/volocal

➕ show 2 replies

babblingfish • today at 5:11 PM

I see a lot of people are confused about the electricity claim so I'll elaborate on it more. The assumption I'm making here is that on device people will run smaller models, that can fit on their machines without needing to buy new computers. If everyone ran inference on their machine there would be no need for these massive datacenters which use huge quantities of electricity. It would utilize the machines they already have and the electricity they're already using.

People are making a comparison of the cost per inference or token or whatever and saying datacenters are more efficient which makes obvious sense. What i'm saying is if we eliminate the need for building out dozens of gigawatt datacenters completely then we would use less electricity. I feel like this makes intuitive sense. People are getting lost in the details about cost per inference, and performance on different models.

ZeroGravitas • today at 7:11 AM

It feels like you'll soon need a local llm to intermediate with the remote llm, like an ad blocker for browsers to stop them injecting ads or remind you not to send corporate IP out onto the Internet.

➕ show 1 reply

jl6 • today at 6:59 AM

Not sure about the using less electricity part. With batching, it’s more efficient to serve multiple users simultaneously.

➕ show 1 reply

eeixlk • today at 1:14 PM

Obviously apple would prefer this. It would boost demand for more powerful and expensive devices, and align with their privacy marketing. But they have massively fumbled with siri for a long time and then missed huge deadlines with ai promises. Despite having billions, they have shown no competency in delivering services or accurately marketing what to expect from ai features.

nbenitezl • today at 9:51 AM

But when using it on the cloud a LLM can consult 50 websites, which is super fast for their datacenters as they are backbone of internet, instead you'll have to wait much more on your device to consult those websites before giving you the LLM response. Am i wrong?

➕ show 2 replies

thih9 • today at 7:37 AM

> it also would use less electricity

How would it use less electricity? I’d like to learn more.

➕ show 1 reply

pezgrande • today at 5:17 AM

You could argue that the only reason we have good open-weight models is because companies are trying to undermine the big dogs, and they are spending millions to make sure they dont get too far ahead. If the bubble pops then there wont be incentive to keep doing it.

➕ show 4 replies

adam_patarino • today at 12:51 PM

We think so too! That’s why we are building rig.ai With how token intensive coding tasks can be, local allows for unlimited inference. Much better fit than sending back and forth to a third party. Not to mention the privacy and security benefits.

➕ show 1 reply

g947o • today at 12:35 PM

Have you spent more than 10 min actually running LLM on a local machine?

As it stands today, local LLMs don't work remotely as well as some people try to picture them, in almost every way -- speed, performance, cost, usability etc. The only upside is privacy.

➕ show 2 replies

zozbot234 • today at 9:00 AM

> Most users don't need frontier model performance.

SSD weights offload makes it feasible to run SOTA local models on consumer or prosumer/enthusiast-class platforms, though with very low throughput (the SSD offload bandwidth is a huge bottleneck, mitigated by having a lot of RAM for caching). But if you only need SOTA performance rarely and can wait for the answer, it becomes a great option.

iNic • today at 9:03 AM

It will probably be a future. My guess is that for many businesses it will still make sense to have more powerful models and to run them centralized in a datacenter. Also, by batching queries you can get efficiencies at scale that might be hard to replicate locally. I can also see a hybrid approach where local models get good at handing off to cloud models for complex queries.

➕ show 1 reply

miki123211 • today at 7:30 AM

> would use less electricity

Sorry to shatter your bubble, but this is patently false, LLMs are far more efficient on hardware that simultaneously serves many requests at once.

There's also the (environmental and monetary) cost of producing overpowered devices that sit idle when you're not using them, in contrast to a cloud GPU, which can be rented out to whoever needs it at a given moment, potentially at a lower cost during periods of lower demand.

Many LLM workloads aren't even that latency sensitive, so it's far easier to move them closer to renewable energy than to move that energy closer to you.

➕ show 4 replies

goldenarm • today at 9:06 AM

It's more secure, but it would make supply much much worse.

Data centers use GPU batching, much higher utilisation rates, and more efficient hardware. It's borderline two order of magnitude more efficient than your desktop.

amelius • today at 8:27 AM

LLM in silicon is the future. It won't be long until you can just plug an LLM chip into your computer and talk to it at 100x the speed of current LLMs. Capability will be lower but their speed will make up for it.

➕ show 2 replies

dwayne_dibley • today at 10:29 AM

This might be how Apple will start to see even more sales, the M series processors are so far ahead of anything else, local LLMs could be their main selling point.

overfeed • today at 6:51 AM

> It's just a matter of getting the performance good enough.

Who will pay for the ongoing development of (near-)SoTA local models? The good open-weight models are all developed by for-profit companies - you know how that story will end.

➕ show 1 reply

aurareturn • today at 5:03 AM

It isn't going to replace cloud LLMs since cloud LLMs will always be faster in throughput and smarter. Cloud and local LLMs will grow together, not replace each other.

I'm not convinced that local LLMs use less electricity either. Per token at the same level of intelligence, cloud LLMs should run circles around local LLMs in efficiency. If it doesn't, what are we paying hundreds of billions of dollars for?

I think local LLMs will continue to grow and there will be an "ChatGPT" moment for it when good enough models meet good enough hardware. We're not there yet though.

Note, this is why I'm big on investing in chip manufacture companies. Not only are they completely maxed out due to cloud LLMs, but soon, they will be double maxed out having to replace local computer chips with ones that are suited for inferencing AI. This is a massive transition and will fuel another chip manufacturing boom.

➕ show 5 replies

gedy • today at 4:53 AM

Man I really hope so, as, as much as I like Claude Code, I hate the company paying for it and tracking your usage, bullshit management control, etc. I feel like I'm training my replacement. Things feel like they are tightening vs more power and freedom.

On device I would gladly pay for good hardware - it's my machine and I'm using as I see fit like an IDE.

➕ show 2 replies

nikanj • today at 6:58 AM

That also means sending every user a copy of the model that you spend billions training. The current model (running the models at the vendor side) makes it much easier to protect that investment

3yr-i-frew-up • today at 10:30 AM

[dead]

alt Hacker News

Replies