> They will be, and that moment is not that far off. It's here, right now. I'm runnin...

reisse • last Sunday at 11:05 PM • 12 replies • view on HN

> They will be, and that moment is not that far off.

It's here, right now. I'm running quantized Qwen and Gemma on a decent, but three years old gaming rig (think RTX 3080 12GB and 32 GB RAM). Yes, it's slow, it has a small context window. But it can (given a proper harness) run through my trip photos and categorize them. It can OCR receipts and summarize spendings. It can answer simple questions, analyze code and even write code when little context is required. Probably I could get a half-decent autocomplete out of it, if I bother with VS Code integration. "128 GB VRAM on a MacBook Pro or a Strix Halo" is already a minimum viable setup for agentic coding, I think.

> And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed.

Currently, it works exactly the other way. The cloud versions are orders of magnitude cheaper than self hosting, because sharing can utilize servers much more efficiently. Company can spend half a million bucks on a rig running GLM 5.1, and get data security, flexibility and lack of censorship, but oh it's so expensive compared to Anthropic per-seat plans.

Replies

pbgcp2026 • yesterday at 8:45 AM

I'm sorry to spoil it for you, but Perl script was able to do all of that like ... 10 years ago? The out-of-the-box Shotwell manages photos quite well without any intelligence. The problem, as people mentioned above, is SOTA models cognitive and tooling abilities. Also, have you noticed as top-end Mac Studios got downgraded recently? They don't want you to have access to frontier models. And you will not have it. See Mythos as Exibit A.

➕ show 9 replies

digitaltrees • yesterday at 12:58 AM

I built my own IDE and run my own model specifically to have private agentic coding. I can still access model APIs but I can be purely local if I want too. It’s amazing.

DrewADesign • yesterday at 3:07 AM

Multiple gazillion dollar companies each seem to be spending to ensure that they alone pretty much dominate all knowledge work, with customers eating up their tokens like Cookie Monster. I wonder if the any of them could survive as LLM providers if they not only failed to do that, but the entire industry ended up selling what the current Cookie Monster would call a “sometimes snack,” for very special occasions?

sanderjd • yesterday at 2:14 PM

Are there any harnesses that are attempting to optimize for using local models like this? Unsurprisingly, my naive attempts to integrate with harnesses designed for frontier models have gone poorly. But it seems like a harness that understands the capabilities and limitations better could perform significantly better.

datadrivenangel • yesterday at 12:02 AM

In my experience once you get to ~30 gigs of ram for a model like Gemma4, the rest of the 128g of memory is simply nice to have. The speed and costs are what make it tough though, because its slower and more expensive than the same model served on a big accelerator card, and is going to be worse than a frontier model.

➕ show 1 reply

fennecfoxy • yesterday at 9:12 AM

>It's here, right now.

I mean I've been forcing my good old 1080ti to run local models since a short while after llama was first leaked.

But I wouldn't say "local models are here" in the same way as "year of the Linux desktop!111"

Until someone can just go out and buy some sort of "AI pod" that they can take home, plug in and hit one button on a mobile app to select a model (or even just hide models behind various personas) then I wouldn't say it's quite there yet.

It's important that the average consumer can do it, I think the limitations for that are: things are changing too quickly, ram+compute components are exceedingly expensive now, we're still waiting on better controls/harnesses for this stuff to stop consumers not just from shooting themselves in the foot, but blowing their foot clean off.

Would be interesting to see a Taalas-like chip in a product, albeit there's so many changes going on atm with diffusion based models, Google's Turboquant (which as someone who has had to almost always run quantized models, makes a lot of sense to me).

➕ show 2 replies

nsvd2 • yesterday at 4:12 PM

I run Gemma locally on a 3090, it's amazing how useful it is to be able to call out to ollama in a bash script or cron job.

winocm • yesterday at 12:22 AM

Perhaps I am the odd one out here, but a small part of me wants to see what happens when you run a proprietary SOTA model on a laptop.

➕ show 2 replies

jimbokun • yesterday at 4:41 PM

Has anyone tried to calculate the break even cost of buying a PC to run an LLM locally, vs the amount of tokens you could get from an AI provider?

➕ show 1 reply

dust1n • yesterday at 8:19 AM

Can you share how you use it to categorize trip photos!

➕ show 3 replies

antidamage • yesterday at 1:11 AM

This is my exact setup as well and dear lord gemma is absolutely batshit insane. I'm trying to get a self-reflection and confidence loop going now, but it does feel like it's not the local resources, it's the limits of the training. Dedicated coding or dedicated real-world task models would be a good optimisation.

yieldcrv • yesterday at 2:24 AM

I need to see these proper harnesses

I tried oMLX and OpenCode a few weeks ago and the 65k context window was useless, it tried to analyze a very small codebase before going full on agentic and ran out of context window immediately

I don't have time to tweak 1,000 permutations of settings just re-prove that its not as smart as Opus 4.6

I need out the box multimodal behavior as similar as typing claude in the command line and its so not there yet

but I'm open to seeing what people's workflows are

➕ show 3 replies

alt Hacker News

Replies