Running local models is good now

941 points • by jfb • today at 2:36 PM • 407 comments • view on HN

Comments

I think a lot of people just don't have specs like that, making it still painful.

I currently have a desktop with a 4060 ti (16gb of vram). Most models I have tested that fit within that are not good enough for anything other then type completion (in regards to coding tasks)

I have been considering getting the 58gb Mac Mini but that is a decent amount of money to spend without confirmation on a) how fast is it and b) will it work for well-defined tasks.

frollogaston • today at 6:01 PM

"Good" refers to the speed and not the quality. There's so much hype about Macs being great for LLMs, but nobody seems to be seriously using them for that because the open models are unfortunately so far behind.

throwarayes • today at 4:33 PM

I am happy to pay OpenAI for a cheaper model a few generations behind. But they deprecate models aggressively. They push you to bigger and smarter models, when 95% of my work doesn’t need it.

I’d love it if model providers just let old models run and let us pay less, but the deprecation makes me want to look into local models.

ta-run • today at 5:55 PM

Not related, but, I can't seem to get my copilot-cli (office is an MS shop) use qwen3.5:27b on ollama for some odd reason.

After the recent changes to usage, I've spent an annoyingly long number of hours trying to get this to work.

blobbers • today at 6:35 PM

Have you tried optimizing for MLX? It seems like a waste to have neural cores and not use them.

I've often wondered why the hype around apple neural core when 99% of software doesn't use them.

fridder • today at 4:05 PM

Is there a local harness designed around the local model use case that is claude code like? Opencode has been problematic at times, pi works for one off for me but not back and forth conversations with the LLM. Considering I only use Qwen or Gemma models I'm close to just writing my own at this point

WASDx • today at 5:30 PM

Looking at some benchmarks, the latest ~30B Gemma/Qwen score similar as Claude or GPT versions that were released just one year earlier. That's crazy progress. I can't imagine how it will be in a few years.

k__ • today at 5:19 PM

I tried some smaller Gemma4 and Qwen3.6 quants on my MBA with M5/16GB and had like 20-60 tokens per second. At 60 it felt pretty okay and that hardware is on the lower end.

I'd assume a Mac with 32-64GB memory would get some reasonable results.

anax32 • today at 3:19 PM

I've just made a milestone on my project, moving away from AWS (budget) to self-hosted and the local models are so much faster than in the past. Beyond LLMs, having embeddings, image, video, audio gen available is crazy.

Running locally is the bar; it's hard to make these things a service which scales.

nikagrawal121 • today at 7:08 PM

I tried for my legal AI application that I'm building and it was able to do majority of the tasks. I used gemma4:26B

prlin • today at 4:26 PM

If you wanted to do some research or learn about post training and agent harnesses, is that a good option with these local models? What hardware is recommended, or easiest to go with a Mac Studio with 64GB+ RAM?

wrxd • today at 4:34 PM

I wonder how much local models hallucinate. I am getting almost daily an "Honest answers: I made that up." reply from Claude Opus when I challenge some silly thing it's trying to do.

malkosta • today at 4:32 PM

The problem with QWEN is that it just can't edit files reliably, I had to hack Pi all over to reduce the pain, but still far from perfect...does Gemma 4 strugle on this?

stared • today at 3:38 PM

I really recommend Qwen3.6 27B.

Make some tests, and its 8 bit version runs at 30tok/s when using llama.cpp with MTP and run on Macbook Max M5. I have 128 GB, but but 64 GB is well enough. https://github.com/stared/benching-local-llms-on-apple-silic...

When using benchmarks, it gives more-or-less the level of SotA mid-late 2025.

➕ show 2 replies

bthornbury • today at 5:50 PM

the qwopus 27b model is good for grunt work style tasks, even across multiple files. Piping a bunch of things through, small factoring changes, stuff that just takes time to type out.

I wouldn't rely on it for large stuff like codex though. I haven't tried out deepseek/kimi, if we could run those locally it would be great.

0xbadcafebee • today at 7:53 PM

Local models have been good for a while. But this being the HN echo chamber, people here think that local models can only be used for coding, and are expecting Opus 4.8 on their iPhone. Turns out AI can be used for things other than just coding. Even tiny models (<4B parameters) can do tons of useful things on local devices. Search, index, summarization, troubleshooting, crafting documents/formatting, image analysis, transcription, object identification, robot navigation, text-to-speech, speech-to-text, browser/window control, MCP/tool calls, and much more.

Larger models just do more complex reasoning. But if you want them to be really good, you need a beefy Mac. They have the best combination of memory bandwidth and RAM to allow medium-sized models to run at speed. GPUs have less memory but more bandwidth, and AMD iGPUs have more memory but less bandwidth. The Mac is the best compromise on the market today.

Once you do have a beefy Mac, you want to run a dense model. This gives you the best possible result with the system you have. You can go MoE for faster results, use cutting-edge inference techniques, parameter tweaks, etc. But a basic dense model (at Q6 quant) on a big-ass mac will serve 90% of your coding needs.

ridruejo • today at 5:38 PM

Local models are one of the main drivers for our installer / Desktop app for OpenClaw https://holaclaw.ai (disclaimer I am one of the founders). The smaller models are really only suitable for the most basic tasks, but if you have 32gb-64gb you can get real work done (ie complex web workflows) without third party hosted models

daniban • today at 4:13 PM

With Apple silicon and now the RTX Spark there are real discussions whether local AI is the future. The only problem is Western open source models are so far behind. I genuinely feel there's a push to fix this. Gemma is getting more frequent releases and Nvdia is quietly creating very cool small models. I hope both the hardware and models catch up and local really does emerge.

ibizaman • today at 3:40 PM

Tangential but reading on mobile, the font size in the code snippets are all over the place. I actually have the same issue on my blog. Anyone knows why?

osigurdson • today at 6:13 PM

Running AI on timesharing mainframes does seem like an odd final state for the world.

Computer0 • today at 10:29 PM

I have 16GB VRAM and 96GB Ram on all my computers and I do enjoy local models. I would not use them for coding, though I have experimented with it, it is largely a waste of time on my hardware. I love local chat with different models however, when using the model in this way it is much easier to experiment with the largest models near the limit of your hardware, and I do find it useful on the airplane somewhat. I have also used local models for data classification tasks and let it run over the weekend etc and the results were acceptable.

xienze • today at 3:35 PM

The big caveat here is that these local models require you to invest some time tweaking your harness, AGENTS.md, and skills in order to get things roughly to the level you'd expect. But something like Qwen3.6-27B with web search capabilities and a good set of skills really is impressive! Especially considering that you can go wild and not worry about token costs.

The other thing that people tend to gloss over is that you really do need to spend some $$$ on decent hardware. Yeah, you CAN run some 4-bit quant with heavily quantized cache on your 16GB card, but it's not going to be a great experience (I think this is where a lot of the "if you think it's gonna be any good, you're going to be disappointed" stuff comes from). Yes it's a lot of $$$ upfront but it's very much unknown when hardware prices are going to come back to reality. There's a lot of hopes and dreams that any minute now an H100 will be worth pennies because "that's how it's always been" w.r.t. computer hardware, but we are living in interesting times. So you can't just make the tired old assumptions that a Claude subscription over three years time will work out to be dramatically less than the value of some card three years from now. We STILL have basically anything with >=24GB VRAM appreciating in value, which is absolutely wild. What I'm saying is, the depreciation curve may very well be a lot less dramatic and fast than it used to be, going forward.

fl4regun • today at 4:21 PM

In my experience, with a system of 32GB RAM and 24GB VRAM, no, they aren't that good.

wasimxyz • today at 3:47 PM

https://canirun.ai

aleksandrm • today at 9:00 PM

Clickbait title, because running local models is still not good now.

drchaim • today at 4:32 PM

really want to try local models, but I don't have the hardware yet. Probably I'm the only one here still using a Mac Mini m1 8gb 2020. :/

➕ show 1 reply

atulmy • today at 6:48 PM

Exact reason I'm building csuite.so, do check it out and let me know if you need early access!

matrix12 • today at 7:58 PM

gemma:12b at 75% of frontier? Yeah....

Mr_Eri_Atlov • today at 8:07 PM

I think this is a pivotal moment for LLMs.

Gemma 4 and Qwen3.6 27B aren't perfect, yet they are such a step forward from the previous generation that it's both feasible to get stuff done locally with patience and very likely that future releases will subvert cloud capabilities entirely.

Plus, they have definite reliability advantages over cloud models that can be wiped out by a government order or lobotomized to handle traffic surges.

jmyeet • today at 7:55 PM

It's not "good". A more accurate description would be "sometimes useful and not far from being good". The author is using pretty small models. There have been a lot of improvements that scale in any case (eg MTP) but ultimately this is still hardware limited by 3 factors:

1. Memory bandwidth

2. VRAM size, which limits the size of a model you can use effectively. Yes you can swap but then you're taking a performance hit;

3. Raw FLOPS, including quantization.

Apple here is interesting because they have a shared memory model and you can buy Macs currently with up to 128GB of RAM (previously 256/612GB on Mac Studios, both discontinued). New M5 Mac Studios are expected in Q3 but that's not guaranteed. It may take until next year

Depending on the chip, Macs top out at ~900GB/s. A 5090 or 6000 Pro has 1800GB/s. A B100 is at like 3.2TB/s. A 5090 has, depending on how you count, 5-7x the FLOPS of a M5 Pro so a 5090 is still better than any current Max... except for the 32GB limit.

NVidia aggressively segment the market by limiting VRAM. The RTX 6000 Pro is basically a 5090 with slightly more CUDA cores and 96GB of VRAM instead of 32GB for $10-11k instead of $3k.

So let's project this into the future a little. The M6 Ultra/Max may well be 1TB+/s memory bandwidth with much higher FLOPS and thus actually be competitive for larger models. A 6090 in the current market will probably still have 32GB of VRAM if I had to guess. Maybe it goes up to 48GB.

But anyway I think we're only 2-3 years away from sub-$5000 hardware that does 100-300+tok/s on models larger than 31B. And that's going to be a game changer.

ZionBoggan • today at 4:55 PM

This is actually a really insightful post !

jingw222 • today at 5:08 PM

open source must win

jauntywundrkind • today at 7:27 PM

i'd love to get to a point where big models can launch subagents that are fast and local. there's a lot of focus on token rate, but just as much, the way cloud providers have other latencies & processing styles not optimized for latency (running large batches all at once), and i think local might have some real wins. Gemma 4 seems already on the right track. lfm2.5-8b-a1b (https://www.liquid.ai/blog/lfm2-5-8b-a1b) and DiffusionGemma seem to both be very high token rate. but getting that latency down, so that a series of tool calls can happen faster, would be a real win. I think especially with good prompting that becomes much more possible.

One caveat, I have absolutely no patience for a lot of subagent systems, like opencode, where the subagent is walled off and incommunicatable. My subagents really should be their own session, that i can deal with as I please, with some MessageChannel like offerings/tools available to them. Ideally with modes where messages auto-flow in and out, and modes where I can be a gate-monitor. https://developer.mozilla.org/en-US/docs/Web/API/MessageChan...

Not really super related but MCP has been working on Events for a while. That ability to respond fast would be great. https://github.com/modelcontextprotocol/experimental-ext-tri...

Asking local to be fast feels like an obvious folly, but given how much better small models have got, and seeing these models tune themselves for speed: I want to hope!

monegator • today at 4:08 PM

I've been trying local models for the boring stuff you might be thinking about: writing small docs.

So i've tested a couple, and the speed is finally impressive. My colleague uses paid tiers of claude and GPT, and the speed is comparable. Maybe even slightly faster on my end.

The problem is: i'm running the model on my work laptop, a 12th gen i5 with 16GB of RAM (which, you know, i asked to upgrade to 64, but that was right at the time of the great RAM shortage of the '20s) so i'm pretty limited in what i can use. And this is running alongside the usual suspects: Web browser hugging 1.5GB, MPLABX hugging 3, windows taking at least 5 just to sit idle, thermal throttled to 1GHz ... And yet its speed is comparable to a paid service. A lunch's worth of tokens vs a few cents of power.

So, what i found, what i fount... What i found is that i need AT LEAST 16k of context window, otherwise they will halt when i pass a small C file for analysis. And coding models will shit the bed with 4k. But we all know that, context size is King.

I found out that Qwen will keep looping while thinking, but that's not a surprise to you, either. But give it enough time and you will get an useful answer. I was hoping to using it as a better warning system for some languages, but i fear i need muuuch more context size, because i tried to feed a file that had a function with an endless loop:

At 4k context it almost shit the bed if i gave it just the offending function, then told it where to look at. At 16k context, with the whole file, it needed some guidance to what the problem was, and after 10-15 minutes of thinking it found the issue. Problem is, it kept second guessing itself for another 20 minutes on the same unrelated thing before giving the output. For which the fix was wrong, but the semanthic was correct. Good enough. Maybe it will be faster if i don't ask for a fix (which i didn't i just asked to look for a specific issue)

Wish i had 3 times the RAM so i can see what happens with more context.

Then i gave it the task to analyze a C file to make an API document. It took half an hour, but then i had a good starting point, which i had to keep changing because it would confuse commands with IDs and things like that.

This was the Qwen 3.5 9B model.

I then tested Gemma 4, being impressed at the tokens per second it gives on my Pixel 8A. Same tasks: same issues with short context, with long context it gave absolutely useless answers when looking at code, but it took 1/3 the time of qwen.

In producing documentation, instead, it was much faster, and it never hallucinated data. Good. in 15 minutes i had everything done.

Not bad for stuff running on a business laptop, while doing actual work.

Tomorrow i will try Qwen 3.6, let's see how it goes..

holoduke • today at 5:40 PM

Good? My Macbook m3 with 36gb locked up after it filled all memory with Gemma4. A bit useful yes. But it eats all resources. For local models to be useful we need at least 128gb of system memory and 512gb of video memory. Plus 8 times the compute of a single 5090/h200

Littice • today at 11:30 PM

[flagged]

hottrends • today at 10:49 PM

[flagged]

mrkn1 • today at 9:43 PM

[flagged]

aplomb1026 • today at 6:18 PM

[flagged]

eugmai86 • today at 6:23 PM

[flagged]

RishiByte • today at 6:16 PM

[flagged]

kordlessagain • today at 3:24 PM

[dead]

maxothex • today at 4:01 PM

[flagged]

Veer_Pratap08 • today at 4:15 PM

[flagged]

azzzxcc123 • today at 5:15 PM

[dead]

huflungdung • today at 5:42 PM

[dead]

Rekindle8090 • today at 5:42 PM

[dead]

Lapsa • today at 7:19 PM

[dead]

alt Hacker News

Running local models is good now

Comments

🔗 View 2 more comments