logoalt Hacker News

Local AI needs to be the norm

1809 pointsby cylolast Sunday at 5:19 PM720 commentsview on HN

Comments

proniklast Sunday at 9:08 PM

They will be, and that moment is not that far off. We've got the progression in place already: first, large data centers could have performant LLMs, we are now firmly in "a bunch of servers with a couple of H100s each" territory, slowly going into "128 GB VRAM on a MacBook Pro or a Strix Halo". Within the next year, the pattern of "expensive remote LLM for planning, local slow-but-faster-than-human LLM for execution" will become the norm for companies, slowly moving to "using local LLM for everything is good enough". And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed. The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.

show 11 replies
Akuehneyesterday at 12:42 PM

I feel like lots of people here are just commenting on the headline.

This isn't about the local models you're running on your old gaming rig, or the tesla p40 rig you build for local llm's.

This is about code leveraging the local resources where the code is running for it's AI needs. Rather than making an API call to an external AI service, the code leverages the AI capabilities built into the hardware it runs on. With modern Apple, Intel, and AMD silicon all shipping dedicated AI acceleration, this is the where IMO the focus should be heading.

How many Flops or whatever can your phone do? I bet it's enough to paint the walls of your living room, or draw a pretty good pelican on a bike.

show 7 replies
0xbadcafebeeyesterday at 1:31 AM

Here's some things you can do right now with local models on a consumer device:

- text-to-speech - speech-to-text - dictionary - encyclopedia - help troubleshooting errors - generate common recipes and nutritional facts - proofread emails, blog posts - search a large trove of documents, find information, summarize it (RAG) - manipulate your terminal/browser/etc - analyze a picture or video - generate a picture or video - generate PDFs, documents, etc (code exec) - simple programming - financial analysis/planning - math and science analysis - find simple first aid/medical information - "rubber ducking" but the duck talks back

A quarter of those don't need more than a gig of RAM, the rest benefit from more RAM. Technically you don't even need a GPU, it just makes it faster. I do half that stuff on my laptop with local models every day.

That said, it really doesn't need to be local. I like the idea that I can do all that stuff offline if I'm traveling, but I usually have cell service, and the total tokens is pretty cheap (like $2/month for all my non-coding AI use).

show 4 replies
chakintoshyesterday at 3:04 PM

I'm literally working on an iOS app right now that needs to infer some input fields from free text typed by the user. Now to take into consideration typos, unstructured text (pricing, dates .. etc), I was pondering a cloud LLM or a basic local parser or even a local on-device LLM (ANE for 15+ devices and a different on-device LLM for the older models)

For the different on-device LLM, I literally went to HuggingFace and filtered by the smallest available models that can do the job, and Granite-4.0-h-1b works just fine, it corrects typos, infers dates, currencies all fields I need.

And it got me thinking how my first reflex was to rely on a cloud LLM which is waaay overkill for my need. Granted, an on-device LLM will need to be loaded on the devices on install or downloaded after the fact (which adds latency when the user needs it for the first time) but still, it's a better tradeoff than a cloud LLM.

I decided on a basic parser, and so far it seems to work fine. granted, it struggles with some words, but I just need to finetune it to have as much coverage as possible in terms of typos without triggering false positives.

A lot of developers have that reflex too and go along with it and then just pass the API costs to the customer. I could have gone that route too but turned out I don't even need an LLM for my usecase.

show 2 replies
gkcnlryesterday at 3:45 AM

It seems like everybody is focused on "LLM"s, a.k.a Large Language Models. One interesting addition to that is fine-tuned- small parameter, distilled, context-dependent small language models that:

1- Do a particular task with great capability (due to its constrained, limited scope) 2- Do it in such a way, it integrates gracefully in your workflow without ever requiring you to know you are using an LM.

There is a difference between outsourcing your workflow to AI and actually utilizing it.

Check this: https://www.distillabs.ai/blog/we-benchmarked-12-small-langu...

show 1 reply
adamtaylor_13yesterday at 2:15 AM

Cool, well let me know when Opus 4.5 level performance is available locally, at speeds that serve everyday use, and 100% I'm right there with you.

Until then, I'm going to keep sending my JSON to the server farm in Virginia because it's the only place that can serve me a model that actually works for my uses.

show 11 replies
TheJCDentonlast Sunday at 8:15 PM

For the mainstream audience, the sentiment around local ai today is the same that they had around open source a few decades ago. For a few products, some paid solutions were so much more advanced that open source were very often completely overlooked. Why bother ? And the like. Then we had captive SaaS and other plateforms and now it's obviously wrong for most of us.

The dependency we have with anthropic and openai for coding for instance is insane. Most accept it because either they don't care, or they just hope chinese will never stop open weights. The business model of open weights is very new, include some power play between countries and labs, and move an absurd amount of money without any concrete oversight from most people.

It's a very dangerous gamble. Today incredible value is available for nearly everyone. But it may stop without any warning, for reason outside our control.

show 12 replies
Guillaume86last Sunday at 9:46 PM

I think we should separate the private AI discussion from the local AI discussion. The pragmatic choice to run big LLMs is one/several big servers online, but that doesn't mean private companies should be the only ones to run them.

A self hosted inference solution that offer good tenant isolation guarantees (ideally zero trust) and is easy enough to deploy and maintain (think Plex for AI) would be my choice for privacy. Now to be honest I have done zero research about this and have zero idea how feasible that is, maybe it already exists and there's some discord servers I should join?

Edit: I don't need to mention it here but what's incredible is that open models are in the ballpark of the best commercial models so supposedly, the hardest part by far is already solved.

show 1 reply
rmunnyesterday at 4:09 AM

For image generation, this has already happened. To what degree, I can't tell, as I don't do image generation much so I don't have numbers on Midjourney subscriptions or any other image-AI-as-a-service sites. But civitai.com has become a place where people share their models, based off of Stable Diffusion or other similar bases, with various fine-tunings to achieve desired results. You name it, you can find a model for it at Civitai, and people doing some very creative things with them. (And also a lot of the obvious things, but it's the Internet, what did you expect?)

I haven't seen a text-based model sharing site spring up yet (perhaps they already have and I don't know about it yet). Civitai, being focused on image-generation, has the obvious advantage that it's easy to show off impressive results from the model on the front page of the website, and judging what someone's home-grown fine-tuned LLM will produce is a lot harder. But at some point I expect a Civitai equivalent site for text models, especially code-based ones, to become popular. That will seriously undercut Anthropic, OpenAI, et al, and will probably force them to find a price equilibrium.

Because once you're competing with "I spend $2,500 up front on a powerful video card, download an open-source model for free, and then I get pretty much everything I need for free" (additional power cost of running that video card isn't nothing, but probably not noticeable in your power bill compared to what you're already using)... then suddenly $200/month means your customers are thinking "after one year I would have been better off with the homegrown solution". The only way they'll continue to pay $200/month is if Claude/GPT/Gemini/whoever is truly head-and-shoulders above the "pay upfront once for hardware then use it for free afterwards" models available. And that's going to be doable, perhaps, but tough.

show 2 replies
tzmyesterday at 6:00 AM

People want local AI, but only if UX is good. Tooling/harness quality may matter as much as model quality.

I think the future will probably be a hybrid of:

1. local AI for simple, private, everyday tasks

2. online AI for very hard or long tasks

show 3 replies
supermdguyyesterday at 2:12 AM

Interesting to see this after the recent post about Chrome’s on-device model using up 4gb of storage, which frustrated a lot of people [1].

I agree local models are great, and it’s cool that Apple has models built in now. But I feel like it basically has to be an OS level feature or users are going to get upset. I’d certainly rather have a small utility call out to OpenAI than download its own model.

[1]: https://news.ycombinator.com/item?id=48019219

show 1 reply
wrxdlast Sunday at 9:43 PM

The example in the post confirms my theory that for local models to succeed they need to be "good enough", not big enough that they can compete with frontier models.

They need to be able to do a small task well and they need to be able to run reasonably on consumer-class devices. Even better if they can run on mobile phones.

In my experiments with local LLMs I noticed that while increasing the size of the model is nice the real thing that turns a barely useless model into something useful is the ability to use tools. Giving my models the ability to search the web and fetch web pages did way more to solve hallucinations than getting a bigger model. And it doesn't have a training cutoff. Sure, the bigger model is probably better at using tools but I often find the smaller models to be good enough.

show 1 reply
wolvoleoyesterday at 3:11 PM

> Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.

> And for those tasks, local models can be truly excellent.

100% true and I use them for this. But the open-source models seem to be drying up unfortunately. There never was much incentive for the big players to train a model and give it away for free, it was mostly virtue signalling and advertising for their knowhow. The AI "race" seems to have entered a new phase that's more on clamping down costs and making money and this doesn't fit in well.

I hope good local models will still appear but the days that there was a new groundbreaking model for download every couple of weeks is over :'(

sinansakayesterday at 2:08 PM

I'm betting my startup on it. The subsidised model subscription will start to dry out and providers will lean heavier into locking down how they want their models to be used (Anhropic has been paving the way already). The only way forward is open weight models. If you are working on any LLM powered product be careful betting on utilising user subscriptions.

show 2 replies
robot-wranglerlast Sunday at 10:05 PM

Entrenched interests are going to do everything to stop local, but there's at least a few technical reasons to believe small and specialized models could be the norm eventually. If that does happen, local will follow.

TFA is focused on whether big models are necessary for what users want. There's some evidence they may never actually be reliable enough unless a) mechanistic interpretation matures far enough or b) our multi-agent systems all become multi-model.

For (a), advancement in MI might fix problems with big models, but would also mean we can maybe get unified representations, and just slice and dice the useful stuff out of huge models, getting only what we need without the junk. Ability to isolate problems won't really come without bringing the ability to isolate functional subsystems. Only want logic? Only vision? Just cut it out of the big monster and enjoy reduced costs and surface area for problems.

For (b), just look at stuff like the evil vector, or the category of hallucinations specific to tool-use. Without a complete solution for helpful/honest/harmless alignment, it seems likely that creativity and rigor (and many other things) are fundamentally at odds. If you start to need many models for everything anyway, why do we need the huge expensive do-everything ones? So specialization also becomes a pressure to shrink everything towards minimal reliable experts

revolvingthrowlast Sunday at 8:06 PM

A local Answer Machine is the dream, especially when the internet is decaying and generally on its last legs, but the hardware requirements seem like a huge mountain to climb. Things are progressing tremendously - deepseek v4 flash is very good for what it is - but even that goes beyond any reasonable local setup, which imo is 128 GB ram + 16 GB vram. 4 ram slots on a consumer board craters ram speed, 256 gb macs are too expensive, and even then the inference is ungodly slow.

On the other hand… v4 flash model is actual magic compared to what was available 2 years ago. If the rate of improvement stays as is, we’ll get a similar performance in a ~120B model in a year, which is viable (if expensive) for everyman hardware. Possibly you’ll be able to run its equivalent on a ~$1200 laptop by 2028, which for me-in-2020 would sound straight out of a scifi movie. A good harness that lets the model fetch data from other sources like a local wikipedia copy from kiwix could do a lot for factual knowledge, too; there’s only so much you can encode in the model itself, but even a cheapish (pre-curent prices) 2TB drive can hold an immense amount of LLM-accessible data.

Big caveat: I don’t see local models for programming or generally demanding agentic tasks being worth it anytime soon. You likely want bleeding edge models for it, and speed is far more important. Chat at 20tok/s is fine; working on even a small codebase at 20tok/s, especially on a noticeably weaker model, is just a waste of time. Maybe it’s a PEBKAC but I have no idea how people make any meaningful use out of qwen 3.6.

show 1 reply
scriptsmithlast Sunday at 9:22 PM

I've got some demos of what the new Prompt API in Chrome that uses a local model can do: https://adsm.dev/posts/prompt-api/#what-could-you-build-with...

As OP says, it shines in constrained environments where the model is transforming user-owned data. Definitely less useful for anything more open-ended.

show 3 replies
leocyesterday at 3:26 PM

(I am not an expert on anything.) One happy circumstance here is that while the RAM cartel is chasing Big AI's money today, in the medium term its self-interest probably makes it a supporter of local AI. A new, compelling reason to have 128GiB, 256GiB or more of VRAM on all your devices? You can be sure that the dollar signs are glowing in their eyes already. The less efficient use of VRAM by personal devies (any given device's VRAM will be mostly idle much of the time) tends to make it more attractive, all else being equal (though of course it isn't) compared to the centralised systems run by engineers and accountants striving all day to maximise ROI; and in any case, since the short-run supply constraints on RAM go away in the longer term, the RAM manufacturers will be able to supply both. My guess is that you can probably also also explain Apple's AI strategy (sit tight and wait for Moore's Law to make local AI more viable) and maybe even nVidia's (lay the groundwork for a gradual switch from selling shovels to the army to selling shovels at Home Depot over time, at least as a Plan B) in similar terms.

show 1 reply
duchenneyesterday at 3:38 AM

Cloud models can use batch processing which is significantly more efficient. A local model has basically a batch of one which takes as much time to process as a batch of 100 because the gpu is memory bound and spend most of its time loading the model from vram to the gpu cache while the gpu cores are idle. With a batch of 100 the model loading time and compute time are roughly similar. So local Models have a first 100x lower efficiency. Secondly, local models are idle most of the time waiting for the user to write a prompt, so the efficiency gap is probably more around 1000x.

show 2 replies
timeattacklast Sunday at 8:05 PM

My problem with LLMs (apart from philosophical aspects and economical impact) is that it would be unlikely for any of us to be able to train something functional locally (toy-like LLMs -- sure, but something really useful -- no). Apart from that it requires immense computing power, it also requires a dataset which is for the most part is obtained illegally.

show 8 replies
vb-8448last Sunday at 8:27 PM

> Use cloud models only when they’re genuinely necessary.

The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.

I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.

I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.

show 3 replies
jillesvangurpyesterday at 9:32 AM

I get the sentiment for self hosting. But there are a few counter arguments:

- Self hosting is expensive. It involves expensive machines with GPUs that cost hundreds per month if you use cloud based ones. You might need multiple of those. And you need people to mind those machines and they are even more expensive per month.

- If you run stuff on your laptop, it consumes a lot of resources and energy. I have qwen running on my laptop. Even minimal usage turns my laptop in a radiator. Nice as a demo, but I can't have it this hot all the time. It would run out of battery, and it's probably not great for longevity of components in the laptop.

- Models are evolving quickly and the self hosted smaller ones aren't as good when it comes to things like tool usage, reasoning, etc. Being able to switch tot he latest model is valuable.

- It's easier to get your use case working with one of the top models than with one of the smaller self hosted ones.

- If you get the wrong hardware, it might not be able to run the latest models very soon.

- Self hosting models is mostly a cost optimization. It only becomes relevant if you hit a certain scale.

- You have alternatives in the form of hosted models via a wide range of service providers. Some of those are EU based and offer all the things you'd be looking for if you are offering your services there. Including legal requirements.

- Reinventing what these companies do in house is technically challenging and possibly more expensive than self hosting models because now you need a lot of engineering capacity dedicated to that. And legal. And all the rest.

If, like most companies/people, you are at the experimenting stage, the cheapest and fastest is just getting an API key from an API provider of your choice. You can take it from there if your experiment actually works. And then it's mostly about optimizing cost. If your API usage goes to the thousands per month or worse, it becomes a cost/quality trade off.

manyatomsyesterday at 3:32 AM

It just depends how quickly models become "good enough" that we don't care about SOTA

show 1 reply
almogodelyesterday at 8:20 AM

Remember nodes and graphs? A comfy user interface allows pretty incredible wiring among models local ai is like eurorack. The current graph skews heavily towards a a pair of small dense models collaborating with the large heavyweights selectively. It’s Qwen 3.6 27B with Gemma 4 31B, both unquantized, bf16/fp16, with phi 14b, nemotron cascade 2, and then those large heavyweights, r1 and subsequent deepseek models including speciale, gpt oss 120b, glm, min max,kimi, command r, mistrals, ever body, up in one graph, all them llm nodes patched and interconnected. Slow, resource intense, better than non local ai. I used Matteo’s graphllm for inspiration, and comfy ui (and st), and used the models to roll a new imgui node/graph model compositor. Now what?!

show 1 reply
ninjahawk1yesterday at 6:33 AM

In my opinion, this is similar to the earlier internet and computers. Few households or individuals had access to state of the art computers, it was primarily research or more well-off individuals. Most random people didn’t really know what it was and certainly didn’t use one.

Now today, AI is very expensive and not readily accessible to most people without paying a good amount.

The early internet became now you can just get a free phone from phone companies so long as you get their extras. Then you get a ton of subscriptions and ad-ons, but you don’t have to spend money, could just use youtube with ads etc.

Local AI would similarly shift this dynamic to paying for access to plug-in’s and tools for your local AI to be able to use. Like how the subscription model works right now.

With local model advancements, such as specifically Qwen 3.6 35B A3B, this future is becoming more likely by the year IMO.

mitchsayreyesterday at 11:52 AM

I now fully believe that the models will soon be compact enough to work even on older mobile devices. I work on lightweight text-to-speech models. After training on distilled datasets the models sound basically the same as any closed-source speech API model, they just need a ton of data to train on. Other researchers are seeing similar gains with other types of models and its only a matter of time before one drops thats commercially viable. Once this happens, the innovative apps and games will begin shipping AI as a feature that drives the user experience forward, rather than the thing you price the entire product around.

hyfgfhlast Sunday at 10:17 PM

Local LLMs is the only thing viable and probably the only thing it will remain once the hype dies down.

A smaller cheaper local model can delivery most the value for coding, while we still use some services for code review and security compliance.

Once the VC money runs out and they start to charge the real price, the C-level will have to impose budges or limits. The current pissing contest over who can expend the most tokens is both ridiculous and shortsighted

jjordanlast Sunday at 7:59 PM

It feels like we're one technological breakthrough away from all of these data centers going up to be deemed irrelevant.

show 3 replies
mgrundyesterday at 7:46 AM

I really really want to like local AI, but I highly doubt it will see wide adoption for a long time.

The additional up-front cost for hardware designed to run an LLM in addition to normal workload is unlikely to be accepted by most consumers.

The scale will be very constrained (like Apples on-device models which are small, heavily quantized, and have a small 4K token context window). It’s also terrible for battery life.

AI as it is implemented today is simply just computationally expensive and unless you put in dedicated hardware (like the ANE) for only this purpose - a large cost driver - I don’t really see it getting large scale adoption.

Companies will probably need a server-backed solution as fallback if they want reasonable user experience, so why even invest in diverse hardware support.

PeterStuertoday at 5:24 AM

I use a 4090 and 96GB ram to run local models slowly (atm Qwen-code-next at 7 tps) with their full context window. I keep this up just for testing and practicing fallback should I lose access to Claude and GPT.

holtkam2last Sunday at 8:33 PM

I wish I could upvote this twice. We (devs) really REALLY need to consider on-device compute before going to the cloud for LLM inference.

mattlondonlast Sunday at 8:36 PM

Yet there is another post a few rows down where people are losing their shit that Chrome has a local LLM model that uses a couple of GB of space for local-inference.

Damned if they do, damned if they don't.

show 8 replies
Animatslast Sunday at 9:01 PM

Question: for software development, how much of an AI do you need for local development? Can it be run locally? Can someone train something that knows a lot about software but lacks comprehensive coverage of history, politics, and popular culture?

show 3 replies
latentframetoday at 3:42 AM

A lot of AI aspects probably don’t need to be permanent cloud services as local hardware improves part of the industry may change from renting intelligence to on-device computing.

deweywsuyesterday at 3:30 PM

How is having local AI going to produce a result that's any better than using OpenAI or Anthropic? Isn't what we really need programmers who rely on themselves more than AI so they avoid technical debt accumulation?

show 1 reply
diwankyesterday at 3:11 AM

in order for us to get there, i think we need a standardized api at the os layer for local models so that the os could optimize, batch and safely allocate resources. something like an analog of chrome's local model "prompt" api but provided and managed by the os itself. the user can choose which model they want to primarily use and so on but all of the heavy lifting and continuous batching is done automatically by the os

QuadrupleAyesterday at 4:36 AM

Not sure how excited I feel about visiting your website and having it auto-download a 8GB model with GPT-3.5 level hallucinations, and then probably crash because I only have 6GB of VRAM. My dad won't be able to use it, or anyone else without a bleeding edge device. On a powerful enough "neural engine" device the battery will be drained quickly, while the heatsink burns a hole in my lap.

show 1 reply
gregjwyesterday at 8:24 AM

Is there a place to learn more about Local AI specifically and maybe even more specifically about models for bespoke purposes or curating them yourself for more specific uses? Feels like theres a lot of fat you can trim off because you don't need generic use, but I don't understand where to even begin there.

show 1 reply
nateyesterday at 2:04 AM

I've been fooling with the Apple Foundation model for AlliHat, so you can chat with it from a Safari sidebar instead of just Claude. It's passible for some basic things like summarizing a page. But it really reminds me of Claude from like 3 years ago. I was trying to get it to generate synonyms for me and it would only generate about 10 with some duplicates. And when I asked for more, it said it would be a waste of resources to generate more. It has some kind of "act responsible" thing that Claude seemed to have. I also asked it to help me come up with synonyms for the game Pimantle, and it decided Pimantle was related to the adult industry and no matter how many times I said "it's just a game" or "I think you've misunderstood", it was stuck on not helping me with anything related to adult websites. And recommended I should play Wordle instead.

All of this being said, it seems Claude gave up this "constitution" it used to train on? I remember trying to get it to help me code some video editing tools, and it was convinced I was pirating videos and so wouldn't help me anymore in that session.

tomeldersyesterday at 8:21 AM

I do think local models are the future, but there's still the question of cost to be answered. Even if there's some slew of effincency improvements that mean an LLM can run locally on consumer level hardware on an affordable budget (and that's a big "if"), there's still the cost of training the modles to consider.

Assuming we end up in a future where people pay to run multiple smaller models on their machines for specific tasks (e.g. A summariser model, a python coding model, or however fine grained/macro you want to go), the people training those models will need to turn a profit.

So how much will that cost? And how often will consumers have to pay? Models have a very short self life. Say you have a dedicated python coding model - that needs re-training every time there's a significant update to the language itself, any popular packages, related technologies (e.g. servers, cloud infra etc). So how often will users need to "upgrade" to the lastest version? It's going to be "frequently".

And it still needs the language stuff on top of that. Users aren't going to interact with a python coding model by writing python. They're going to use natural language. So the model needs all that stuff. And they're going to give it problems to solve. What if you asked the model "Write me a Bezier curve function". It needs to know about bezier curves, which have nothing to do with Python. So where do these LLM providers draw the line on what makes it into the training data and what doesn't?

And if an LLM doesn't know what a Bezier curve is, that's not going to stop it from just hallucinating an answer. If a significat proportion of prompts resulted in a response that said "Sorry, I don't know what you're talking about", then people will just stop using it. The utility of these things will be quickly overshadowed by the frustrations.

The way these frontier models have been introduced and promoted has set unrealistic expectations, and there's no putting the genie back in the bottle.

show 1 reply
hackermanaiyesterday at 4:32 AM

> “But Local Models Aren’t As Smart”

This is what makes me continuously doubt and rewrite the local-first approach to inline chat in my editor. Next edit/ code complete makes more sense due to latency advantage. But chat is hard.

It's fast and feels good to run locally, but output quality is just not ChatGPT etal.

acidhousemcnabyesterday at 10:36 AM

We need better GUI and OS integrations with sandboxed local LLMs, before this is thrust on everyone and rolled out as the default in commercial OSes. Here in Berlin, I was functionally surrounded and hounded out of a local meetup, due to confrontation over the naive pushing of OS-level and network access agentic AI, done in the mode of mystical powers and artistic possibilities, which due to recent experiences, comes off as string-pulling, to produce a threat or danger that then must be observed and kept tabs on, according to Goodhart's Law.

hackyhackylast Sunday at 9:34 PM

I would like a standardized API for local AI to exist outside of the Apple ecosystem. The Prompt API is Chrome is halfway there.

* What is the answer to local AI for native apps on Windows?

* What is the answer to local AI for Linux?

This is a big opportunity for Linux, given the high quality of open-weight models. I hope some answer emerges before designs fracture and we get a dozen mutually incompatible answers.

show 2 replies
everlierlast Sunday at 10:50 PM

There was never a better time to run LLMs locally. It's just a few commands from zero till a fully working LLM homelab.

``` harbor pull unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL

# Open WebUI -> llama.cpp + SearXNG for Web RAG + OpenTerminal as sandbox harbor up searxng webui llamacpp openterminal ```

That's it, it's already better than Claude's or ChatGPT's app.

Tepixyesterday at 7:33 AM

I'm pretty sure that AI assistants will become widespread.

I consider it to be very careless to entrust your emails, your chats, your calendar, your notes, your calls, your pictures, your contacts, your location history, your waking hours, your files, your TODO list, i.e. stuff including your health data to the for-profit AI companies. The temptation to earn money with your data is just too great, plus the risk of the data being stolen and sold illegally.

Local AI should be the default. For everone who can't do local AI, we need confidential compute. Yes, it has been hacked before. But it's making it a lot harder.

show 1 reply
selectedambientyesterday at 11:37 PM

Agree. We ought to be measuring the minimum viability of lesser parameter, local models for specific tasks. You don't need opus 4.7 or sonnet 4.6 to accomplish some of these basic, yet tedious tasks, i.e. the news aggregator you demonstrated. Thinking about things like, how many parameters does it take to manipulate a pdf in every way possible with accurate results? Likely, a reason there isn't a coordinated push toward people running local models is the fact that your data couldn't be mined, manipulated, and abused; obviously outside pure capability of some of the frontier models (which truthfully some of which aren't even very good). While I think we may see more things like Apple's models, like you mentioned being run locally, I think we all know at the end of the day they're phoning home in some way (which if that is fine for you, fine). Again though, and you touch on this in the article, highly specified tasks that have a certain amount of redundancy built in are very suitable for these local models right now, without relying on enormous weights and token usage.

I have been working on a VERY SMALL local-first ai lab myself. nothing crazy, a text editor, a claw, and some lightweight models I started playing with. Absolutely looking for contributions as well.

show 1 reply
continueops_comyesterday at 8:39 AM

Opus 1M context window and lighting fast response time is hard to compete with, even if you run a local A100 the local models are just not as good as tool calling, long running tasks and non-hallucinations

show 1 reply
deividlast Sunday at 11:41 PM

Sounds great, but if you din't cave to apple/google (eg: graphene, lineage), models are not built-in. Every app needs to ship their own models, and they are not tiny.

Is there a solution for this? I'm currently just making users download onnx models if they want a feature, but it's not smooth UX

h05sz487byesterday at 6:24 AM

I really want this to be true. For me getting all models to run to the best of my hardwares ability and the cli tool to also make best use of the model is still a headache. I had coding models not being able to do a search and replace depending on the tool through which they were called, visible <thinking> elements in my message flow, agents doing a task, failing at the linter, then reverting everything again so the linter is happy and presenting the result as a "good compromise".

Right now it feels like we have all the pieces but nobody integrating all that into an amazing experience.

harrouetyesterday at 11:29 AM

Running LLMs locally is one way to realize the level of hardware and infrastructure that frontier AI companies are running. Makes me wonder about future strategies.

As one commenter mentioned, 2x Mac Studio M3 Max with 512GB can run frontier models and it costs $30k (with RDMA). Apply an efficiency ratio for being in a datacenter, and you understand why OpenAI and the likes spend north of $10k _per customer_ of CAPEX.

Add to that the electricity costs and you've got a very shaky business model. I for one would like to thank the VC for subsidizing my tokens.

With that said, the VCs are not crazy and probably factored in an annual cost decrease of computing power. But how do you make sure that we won't run local LLMs when the HW becomes affordable -- if ever ?

The answer has always been the same in our industry: vendor lock-in. They are getting the users now at a loss, hoping for future captive revenues.

So, be careful when your code maintenance requires the full context that yielded that code, and that this context is in [Claude Code|Codex|Cursor].

🔗 View 50 more comments