While I agree that would be the goal, we are too early for that. Just like how speech recognition used to require many server in a Datacenter to process and you send your data over. It is now completely on devices.
We are at least 5 years away from that. And DRAM needs a substantial breakthrough in cost reduction.
I'm going through a similar exercise right now in an app I'm building. No server dependencies, for features that have traditionally used server side APIs, moving those capabilities onto the device. And also utilizing the on-board AI features provided by Android and iOS. So far it's been a very positive experience, and the capabilities provided on these devices have been more than capable for my needs. Working on providing apps that don't have ongoing operation costs of running server side infrastructure, so I can offer them as "pay once, run it forever" instead of ongoing subscription costs for the user.
I think moving straight to local models is missing the required next step of open/self-hostable models which is certain to be the "AI future" end-state. Then local models become an optimization on top of that.
I just dont want us to put all this effort in to on-device computation when we need to get to "SOTA-equivalent" self-hosted computation faster.
People are trying to “make the best software”, though.
I think the Quixotic accelerationists of AI are more or less a vocal minority of the people who make software, and the choice of online APIs over local systems is largely a choice made for users, rather than developer’s laziness.
You can do more and better with private AI today than with local models. There is no getting around that. Even if local AIs get better, being on the cutting edge of LLM performance is often a very worthy investment.
Most people won’t settle for a product if it’s not the very best and incredibly convenient. That’s a high bar, and local AI often doesn’t meet those standards.
HN’s insistence on treating all users like they are open-source, privacy-first, self-hosted Linux fanatics is painfully corny.
It's almost here. Look at the new Qwen 3.6 models. Solid stuff there.
It runs by now on 8GB Vram, so a Legion 5 for about 1500$ could be a good workhorse.
For me, building with open weights models sounds like the right approach — you are able to switch providers, and you can control where the server is running.
You don't have any guarantees in terms of data, that's true, you rely on the provider. But this is similar to a database or other services where you don't have the knowledge or resources to run them yourself. Hardware cost is an additional factor here.
If on the other hand your idea works out and the model fits the use case, you can always decide to move to a dedicated infrastructure later.
WRONG, this completely ignores the most important issue and so is completely wrong.
The important issue is where is the data stored. And there are far to many advantages to having your data in the cloud: you can access it from whatever device you happen to have, and it isn't lost if you lose the device. This also outsources your backups to the cloud which is probably doing a much better job than you would (maybe no on hacker news, but nearly everyone else) - the cloud has earned a bad reputation for backups, but it is still much better than most people would be.
Once you accept the data is going to be elsewhere it doesn't matter if the compute is elsewhere or not. The data is the important part.
What needs to be the norm is more self-hosting your own data. Companies should not be outsourcing this by default - even where you outsource some of it, you need to watch your contracts and ensure the ownership is yours - not shared. Once your data is yours on your own cloud accessible servers we can start asking can we run our AI models in the same data center as we already have our data in. I don't need my AI model to run on my phone, it can run on the server in my basement which has a lot more power available (my phone has a better GPU but I can't afford the battery power to run AI on my phone)
I would love for local inference to be possible, but from my experience, Kimi 2.6 is the only model that would be worth it, and its a $10k (M3 Ultra max spec'd - 30s TTFT so kind of slowish) to $30k (RTX6000/700GB+ DDR5) upfront, noise / power consumption aside.
The roadblock to this is you seem to have to build it yourself. I've noted that none of the current cloud models are very good at building a replacement for themselves, and there's significant work that needs to be done to make a local LLM reliable in any way. I haven't found a single standalone package that makes setting them up easy. Sure, I can run Hermes Agent and a model, but getting the self-reflection loop in and all of the other stuff the need to actually be good? I'm still at it, trying to get anything to work reliably and factually.
Every reply here forgets/overlooks the main reason for why this is not going to happen: The astronomical AI data center investments currently underway. Those place are not just for training. They are for inference too and the way all those investments are expected to eventually pay off. The whole AI sector of our industry depends on running models in these places.
Most people are lazy (which is (mostly) good) and don't care (which is (mostly) not good), as Gmail has proven since 2004 (according to Google AI).
Still waiting for those analog AI chips that were supposed to make it lightning fast using minimal energy...
Chrome did this, and there was a huge outcry. Even though local AI is much better for privacy.
I'm building a protocol and router runtime for hybrid local/cloud AI.
The goal is that you would assign roles to models based on tasks, capabilities and observed performance. The router would then take care of model selection in the background.
It's tricky though. Probably have another two weeks before I can release the runtime.
I have a preview up at https://role-model.dev/
You can follow me on Twitter if you want updates (see profile)
- can we get suggestions from people on what would the equivalent for android
- and for the web / javascript / svelte applications?
- suggestions for local OCR for bulk images?
I'm looking into it since it I'm going to be sending personal info/thoughts would like to keep it local. I have a 4070 running the TheBloke 7B mistral via llama cpp. I still am not using llms daily though other than Google searches.
We are experimenting with local LLM and opencode at work and the quality is not as good as Claude code et.al but it's not far off and local speed is actually faster. We got 3 of Nvidias latest AI GPU's which was not cheep. It's not good enough to train our own models, but we can run the biggest open models with some tweaking.
Relying on external APIs network failure points and unavoidable latency from the round trips. There is also the AI API rate limits that come into play. We might find that for critical workflows, local compute is the only reliable architecture.
Yea i agree to this. Especially considering that now even igpus can get respectable scores, like my iris xe 80eu 16gb ram @ 2133mhz gets like 6-8 Tokens per second in gemma-4-E4B model
We need more tools like QMD that beautifully download and use local models under the hood
Local AI is definitely going to be the future as these models continue to advance at the rapid pace they already are.
This is why I believe OAI and Anthropic I’ve been so aggressive at offering services outside of their pure models like Claude Design. This is what will be competitive and keeping people subscribed.
This article makes 0 sense. Its not up to billing or computer systems or ease of use or anything else that matters. The question is will the scaling laws, which in the asymptote are likely the laws of physics, hold up in converting energy to smarter models. Its not really up to anyone, the labs or developers, to choose if local or remote models will be the norm.
not saying i disagree with the general statement, but there need to be options, not everyone has a machine capable of doing the same type of lifting required to properly run a local version. so what, if my machine is older i'll be locked out? restricted? forced to pay?
Agree with the sentiment, but: "We are building applications that stop working the moment the server crashes or a credit card expires."
This has been the case for way longer than openAI and Anthropic has been around with services like AWS, Cloudflare, etc.
Not your weights not your brain. Owning your own action and decision model is super important as these models emulate more of our decisions, thinking and learning. Built claudectl - a local brain for coding agents https://github.com/mercurialsolo/claudectl
I‘m surpised at the presented dichotomy between JSON formatting and what the Apple SDK provides to parse output into structs.
Based on what I understand about how the former works, I would assume that the latter has the same properties and failure modes.
you know what is the hard part about local ai? Supporting it cross platform. The OP get it easy by playing in Apple ecosystem but when you need to support local AI to both iOS/Android the approach is completely different. Even get the users to download the smallest models can be a challenge
They will never let us have enough RAM every again. RAM will be kept behind locked doors in the name of national security and only trusted corporations will be aloud to run AIs and "safely" run them in the cloud and sell them to us.
Any recommendations to run a local model on a Raspberry Pi 5 16 GB?
Unless there's a breakthrough or a transition to diffusion models, it's hard to imagine them becoming an affordable commodity
Small models are still in their infancy, and there's still much to sort out about and around them, as well
> One of the current trends in modern software is for developers to slap an API call to OpenAI or Anthropic for features within their app.
Well there’s your problem, control needs to go the other way. If you want your app to be AI-enabled, you need to make it easy for AI to control your app. Have you used OpenClaw? It’s awesome!
Overall I'm bullish on standardized local APIs that ship with the browser or platform. Far more tractable than expecting end users to stand up their own local model instances, though r/LocalLLaMA is a fantastic community to follow if you want to go that route.
A useful framing over “local vs cloud AI” can be split along two axes: does the task touch private data, and does it need frontier intelligence? You can use frontier models for developing the software (doesn’t touch data), but open-source models running locally for ops: maintenance, debugging and monitoring (touches data). If you need to fall back to frontier intelligence at some point for a particularly hard to resolve problem, you can still rely on local models for pre-transforming and filtering input in a way that's privacy-preserving or satisfies some constraint before it’s sent off to the cloud for processing. OpenAI's privacy filter is a good example of a model that can be used to mask PII and secrets and that can run locally: https://openai.com/index/introducing-openai-privacy-filter/, before sending any data externally for processing.
Another framing for local vs frontier closed which the article mentions is whether the task saturates model capability. With certain tasks like PDF processing or voice or summarization, adding more intelligence isn't necessarily useful. Arguably we've approached that point for chat interfaces already with frontier open-source models. But for coding and ops through well structured tool use inside a coding capable harness, we're still a ways away.
Tangentially, a contrarian take here is that AI can actually enable more privacy preserving software if you’re so inclined. You can just build personalized software and it lowers the barrier to entry and the effort required to self host. SaaS complexity often comes from scaling and supporting features for all types of customers, and if you're building software for personal use, you don't need all that additional complexity. Additionally, foundational and infra software that is harder to vibecode with AI is often already open source.
Really silly, when you buy "AI PC" with "AI CPU" and still run any "GenAI" related stuff in the cloud.
To what extent is this strategy currently feasible for windows of android development? I am interested in how portable local-first AI is across platforms, but it seems promising on Apple devices.
Harnessed seem to be a big part of what makes stuff good or not.
I tried Cline and couldn't get it working well and part of this was that at the time it expected OpenAIs output format.
> We are building applications that stop working the moment the server crashes or a credit card expires
Isn’t this true of any application that accesses anything not running on your computer? This is just describing what it means to add an API call to your app. Nothing to do with AI (?)
> “AI everywhere” is not the goal. Useful software is the goal.
Great observation! Often the excitement of novelty makes us lose sight of the real goal
agree with the article but the limitation for local llm usefulness is the limited scope from my experiments. eventually context heavy data pipelines require larger models which consumer hardware can't deal with yet. the local model for summary on a page like you describe could be done via code as well, i've found using an llm isn't always the right choice. for example i use ner tagging in my md docs for better indexing and llm search capabilities. this is purely code based and not via an llm. tried with an llm and the results were a lot worse. augmenting tools to make the llm produce better outputs gives better results.
I mostly agree, though I think local AI will need better UX around failure modes. Cloud models are often used not just because developers are lazy, but because they are more capable and easier to support consistently across devices.
GLM 5.1 is very impressive, I wouldn't be surprised if we get to a point where it can live in ~48Gb and have a reliable speed/quality
I use Claude api for my startup and the billings and rate limiting hurt. But local models cant do what i needed yet. Wish they could.
I’m skeptical that local AI will work well with today’s technology. Running capable models consumes too many resources on end-user devices.
it's not going to happen with LLMs unless ram + storage gets several orders of magnitude cheaper like, yesterday
informatics aren't magic, you'll never be able to compress """knowledge""" into a small model in a way equivalent to the 1.5 TB model
Here I was hoping that this was some plea for us to get away from proprietary solutions that we have no control over and go back to open source, but no, not that at all.
The start of the argument is already broken . Ok , slapping api is bad , so you push api that mimics to your provider, install some Chinese llm that will never obey any lawsuit in your country , install 500 packages to do so , every of them has a potential risk a security issue . How is that better ?
Oh yeah , it feels independent and not lazy , sure
I think with turbo quant forks eventually being merged, its becoming more feasible on mid tier consumer h/w
Dont quite think its ready yet.
Any project that requires a local model should always be the way to go on first attempts and if the functionality is acceptable should stay with local models. Token burn is a serious problem and will ultimately lead developers to ask one question "Do I really need Opus xyz?" For most requirements of standard applications the answer is no. So using open-source llm models that are integrating in practical use-cases to create a value-add not for 'hey look I have AI in my app, sign up please.' Open source models are competing well and is the way to go for the majority of projects and mindsets do have to change and I see them changing this way rapidly. You don't have to host your open-source llm locally but host it with a 3rd party, it is cost-effective and the token burn is not a barrier.
Agreed, but the way ram prices are going, I don't think we would be able to afford hardware that can run any useful model.
Local AI will catch up. Unless we can't get our hands on hardware anymore, which is a legitimate concern I have.
Running LLMs locally is one way to realize the level of hardware and infrastructure that frontier AI companies are running. Makes me wonder about future strategies.
As one commenter mentioned, 2x Mac Studio M3 Max with 512GB can run frontier models and it costs $30k (with RDMA). Apply an efficiency ratio for being in a datacenter, and you understand why OpenAI and the likes spend north of $10k _per customer_ of CAPEX.
Add to that the electricity costs and you've got a very shaky business model. I for one would like to thank the VC for subsidizing my tokens.
With that said, the VCs are not crazy and probably factored in an annual cost decrease of computing power. But how do you make sure that we won't run local LLMs when the HW becomes affordable -- if ever ?
The answer has always been the same in our industry: vendor lock-in. They are getting the users now at a loss, hoping for future captive revenues.
So, be careful when your code maintenance requires the full context that yielded that code, and that this context is in [Claude Code|Codex|Cursor].