Local AI needs to be the norm

1814 points • by cylo • last Sunday at 5:19 PM • 725 comments • view on HN

Comments

TechSquidTV • last Sunday at 11:15 PM

Local AI will catch up. Unless we can't get our hands on hardware anymore, which is a legitimate concern I have.

vegabook • last Sunday at 8:08 PM

>> years ago I launched "The Brutalist Report"

proceeds to brutalise the reader with an 88-point headline font.

karmasimida • yesterday at 3:30 AM

How? Memory price is sky high, that is the choke hold the monopoly will not let go

1a527dd5 • last Sunday at 9:50 PM

Consumer/private needs to be local.

Work? I don't want it local at all. I want it all cloud agent.

cl0ckt0wer • yesterday at 10:00 AM

If they do then hardware costs will explode even more

eyk19 • last Sunday at 8:33 PM

Apple stock is going to skyrocket

➕ show 1 reply

anArbitraryOne • yesterday at 2:58 AM

Just let me turn it off to preserve battery life

Salgat • last Sunday at 10:39 PM

Local models are much less energy efficient right?

➕ show 1 reply

agentifysh • last Sunday at 8:07 PM

Until the hardware is economical and powerful enough, local AI that can compete with frontier models today is still far off.

If we could even get something like GPT 5.5 running locally that would be quite useful.

tuananh • yesterday at 2:32 AM

local llm doesn't need to match SOTA performance in order to be useful.

hypfer • last Sunday at 8:09 PM

Same as local compute.

Welcome back to 2014. Let us now continue yelling at the cloud.

osjxjsjxjs • yesterday at 3:15 PM

No AI needs to be the norm. Again.

refulgentis • last Sunday at 9:22 PM

The shitty thing here is, either everyone's shipping 800 MB at least with their binary, or, you have to rely on the platform vendor anyway. I'm hoping there's enough external pressure that the OS vendors turn it more into a repository than a blessed-model-garden.

➕ show 1 reply

wilg • last Sunday at 8:47 PM

Two issues -

1. Local models are likely to be more power-expensive to run (per-"unit-of-intelligence") than remote models, due to datacenter economies of scale. People do not like to engage with this point, but if you have environmental concerns about AI, this is a pretty important one.

2. Using dumb models for simple tasks seems like a good idea, but it ends up being pretty clear pretty quick that you just want the smartest model you can afford for absolutely every task.

➕ show 1 reply

dana321 • last Sunday at 8:45 PM

"NO AI" needs to be the norm, we should be working on better ways of sharing information and better documentation instead of fighting with computers for substandard results.

williamtrask • last Sunday at 7:47 PM

I wonder if a popularization moment for local AI will ultimately be the pin-prick that pops the AI bubble. Like the deepseek or openclaw moments but bigger/next.

➕ show 1 reply

shmerl • last Sunday at 8:15 PM

Depending on some remote AI provider is a major lock-in pitfall. But it's exactly what those AI providers want you to do.

ChoGGi • last Sunday at 10:32 PM

Who can afford local AI?

➕ show 1 reply

unnouinceput • yesterday at 2:46 PM

Quote 1: "We need to return to a habit of building software where our local devices do the work."

Quote 2: "I can only speak on the tooling available within the Apple ecosystem since that’s what I focused initial development efforts on."

Oh, the irony. I will use your tooling when is available on Android with F-droid, that's when, at least, be decoupled from big companies grip.

tristor • yesterday at 1:39 PM

The biggest challenge I have with local models right now (and I use them extensively) is search integration and tool calling. The thing that Claude and ChatGPT get right for most general purpose use cases which is hard to do with a local model is the model deciding when to search vs use its built-in training, and having strong search tooling, as well as tool calling for additional data sources via MCP. If you can incorporate the right data into the context window, local models are more than good enough for general purpose usage as they stand today. Qwen 3.5, Gemma 4, even gpt-oss-120b are solid at reasonable quants if they have the right data.

The moment we see standardized and batteries-included pathways to integrate search, ideally at no additional cost, in things like LM Studio combined with better tool calling in the local models, you'll quickly see local model performance catch up.

worthless-trash • yesterday at 10:41 AM

How long till we have distributed AI, where we can have different people run/understand different parts of problems and pass off work to different nodes across the internet.

alfiedotwtf • yesterday at 6:05 AM

This would be nice, but unfortunately the norm at the moment is - release a rushed model that doesn’t work with llama.cpp, but if it does, make sure that the chat template is broken. And even if it did have a perfect chat template, let the model loop endlessly rewriting the same file with same content for hours on end.

It would be nice if model makers could at minimum embrace test harnesses, and stretch goal if they’re going to change underlying formats then at least land compatible readers in the big engines (e.g. llama.cpp and vllm)

jmyeet • last Sunday at 10:16 PM

I've been looking into options for this and we are getting close. There are two main constraints: memory and memory bandwidth.

NVidia segments the market by limiting the amount of memory on GPUs. It currently tops out at 32GB (on a 5090) but it has excellent memory bandwidth (~1.8TB/s). If you want more than the you need to buy an RTX Pro (eg RTX 6000 Pro w/ 96GB for ~$10K) or you get into high high end solutions like H100, H200, etc that have significantly more memory and even higher bandwidth on HBM memory (eg 3.2TB/s+).

NVidia has released the DGX Spark w/ 128GB of memory for ~$4k. The problem is the memory bandwidth. It's only 273GB/s, which is less than the M5 Pro (307GB/s) but more than the M5. You can buy a 16" Macbook Pro with an M5 Max and 128GB of memory for $6k and it has a bandwidth of 614GB/s. So the DGX Spark is a joke, really.

In case it wasn't clear, Apple is interesting in this space because it has a shared memory architecture so the GPU can use all the memory.

Many, myself include, expect there to be no refresh to the 5000 series consumer GPUs this year, which would otherwise happen based on product cycles. So no 5080 Super, for example. And I wouldn't expect a 6090 before 2028 realistically.

One thing Apple hasn't done yet is release the M5 Mac Studios, which are widely expected in Q3 this year. They are interesting because, for example, the M3 Ultra has a memory bandwidth of 819GB/s and previously had a max spec of 512GB but that got discontinued (and the 256GB version also got discontinued more recently).

So many expect an M5 Max Mac Studio with 1TB/s+ bandwidth and specs up to 256GB or 512GB, probably for ~$10k later this year.

You really have to use this hardware almost 24x7 for it to be economical because otherwise H100 computer hours are probably cheaper.

But what happens when the next generation of GPUs comes out to the trillions in AI DC investment? It's going to halve its value. That's over $1 trillion in capex that will disappear overnight, effectively.

I think Apple is the dark horse here because they have no interest in NVidia's psuedo-monopoly. I'm just waiting for them to realize it.

Now CUDA is an issue here still but I think as time goes on it's going to be less of an issue. Memory is still a huge constraint both in terms of price and just general supply because NVidia can justify paying way more for it than you can, probably.

It's still sad to see that 128GB (2x64GB) DDR5 kits are almost $2k now and werre $400 a year ago. Expect that to continue until this bubble pops (which IMHO it will) and we're likely in a global recession.

So the other issue is models. OpenAI and Anthropic are built on proprietary models. Their entire valuation depends on this moat. I don't think this last so both companies are doomed because open source models are going to be sufficiently good.

We can already do some reasonably cool stuff on local hardware that isn't that expensive and even more so once you get to $5-10k hardware. That's going to be so much better in 2 years that I'm hesitant to spend any amount of money now.

Plus the code for running these things is getting better. Just in the last month there have been huge speed ups in local LLMs with MTP.

➕ show 3 replies

DoctorOetker • yesterday at 2:48 AM

One advantage of local AI is continual learning.

When I say 'moat' I don't mean moat specific to a company vis-a-vis other companies, but 'moat' specific to the set of inference providers vis-a-vis self-hosted local inference.

The moat consists primarily of being able to batch inference requests.

If we pretend people weren't interested in long context-lengths, there would be a moat for inference providers. who can batch many requests so that streaming the model weights (regardless if from system RAM to GPU RAM; or from GPU RAM to GPU cache SRAM) can be amortized over multiple requests.

However people do want longer memory than the native context length.

One approach is continual learning (basically continue training by using the past conversation as extra corpus material; interspersed with training on continuations from the frozen model, so it doesn't drift or catastrophically forget knowledge / politeness / ...).

However this is very expensive for inference providers, since they would have to multiply model weight storage with the number of users U=N. For a single user the memory cost of continual learning is much less since they only need to support a single user, and are returned some of the memory cost through elimination of KV-caches, and returned higher quality answers compared to subquadratic approximations of quadratic attention.

An advantage of continual learning is that the conversation / code base / context is continuously rebaked into model weights, and so doesn't need KV caches! It doesn't need imperfect approximations to quadratic attention, it attends through working knowledge being updated.

Nothing prevents local LLM users from implementing this and benefiting from the dropped requirements of KV caches and enjoying true quadratic attention implicitly over the whole codebase, or many overlapping projects indeed.

The only remaining moat of inference providers vis-a-vis continual learning local LLM's is the batching advantage, plus the gradient update costs for continual learning minus the KV storage and compute costs, minus the performance loss due to inexact approximations to quadratic attention.

This points towards a stronger incentive for local hosting than currently realized (none of the popular local LLM tools currently support continual learning, once this genie is out of the bottle it will be a permanent decrease of the inference provider moat, the cost of which can't be expressed merely in hardware or energy costs, since it is difficult to quantify the financial loss of inexact approximations to quadratic attention, the financial loss due to limited effective context length and the concomitant loss in quality of the result)

➕ show 1 reply

j45 • yesterday at 2:40 AM

It’s easier to say 32 gb ram needs to be the norm to start getting movement on this

krupan • last Sunday at 9:35 PM

If you don't need a lot of smarts, do you even need an LLM? Aren't older machine learning techniques just as good, or like, you know, old-school algorithms?

holoduke • last Sunday at 9:04 PM

We need computers with 128gb or maybe even 192gb of memory before local use make sense. From my own experience 32b LLMs are the absolute minimum for proper tool use and decent output quality. But for local ai you want also vision models and maybe even various LLMs. Plus some memory for the system of course. On my 36gb M3 the 24b Gemma model is nice. But the entire system gets allocated for that thing.

artursapek • last Sunday at 8:11 PM

I'm someone who is trying to build a subscription-based business to cover underlying LLM costs, and very hopeful I can one day just sell a permanent license to the software instead with customers using local LLMs to power it.

cryo32 • yesterday at 11:16 AM

I think no AI needs to be the norm. Even if we have enough RAM to run it locally, the dependency stack we have on hardware, training and geopolitics is too much of a risk to take on. If something breaks, like supply chain, or the model is found to have particular bias or exploits baked in, we're fucked.

QuadrupleA • yesterday at 4:21 AM

This is just emotional rhetoric. Pretty much any app in the last 20 years has depended on a server somewhere, or a cloud provider. Like an AI provider, they can go down, they can turn off if you don't pay your bill, etc.

And local inference requires fairly beefy hardware, that is FAR from ubiquitous across today's userbases. Local models are also still far dumber than what frontier labs can serve.

Weird that this is getting such a tidal wave of upvotes.

senko • yesterday at 1:48 PM

I love this line:

> Stop shipping distributed systems when you meant to ship a feature.

But not in the contex the author meant.

Many people don't realize that when you have a frontend, a backend (several instances, for failover/scaling), a (separate) database, maybe some object store -- you have a distributed system.

A recent article[0] touched on that, although most HN commenters[1] latched on the "go" part. But there's something to avoiding rube goldberg machines where we don't need them.

[0] https://blainsmith.com/articles/just-fucking-use-go/

[1] https://news.ycombinator.com/item?id=48062997

sgt • last Sunday at 5:33 PM

I guess Google got that memo!

cubefox • last Sunday at 8:19 PM

Local AI is a bit like wind parks. Everyone is in favor, except if they are in your own backyard. There was recently a huge outcry when Chrome shipped a local 4 GB AI model: https://news.ycombinator.com/item?id=48019219

I have to conclude that people would like to have powerful local AI but it should at the same time only be a tiny model. In which case it wouldn't be powerful.

skillsora • yesterday at 10:13 AM

[flagged]

xiaosong001 • yesterday at 7:22 AM

[flagged]

maxothex • yesterday at 4:01 PM

[flagged]

theuniverseson • yesterday at 12:53 PM

[flagged]

hona_mind • yesterday at 6:18 AM

[flagged]

qwertmax • last Sunday at 9:43 PM

[flagged]