I don't know about good, I use a lot of local models and they're still pretty painful to run locally
You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow
You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes
You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)
So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs
On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.
So are they good? not really. Do they work? yes
edit: just wanna clarify - i think open models are the future, i think they're super important, i'm contributing constantly to the ecosystem - i think people should play around with these models, i think people should use `pi` and learn how it all works - but don't download a model expecting it to be good out of the box, you will have to tune and configure a lot of stuff to replace a "coding agent" that most people are using models for
Just to piggyback onto this comment; has anyone tried running multiple of these in conjunction? For example, having a Python script that has one of these orchestrate others, and offloads certain tasks to better/more powerful models, or even cloud models?
This is basically my experience as well. I have a moderately recent but high spec desktop (Radeon 6900 XT with 16 GB VRAM, Ryzen 9 7900X 12-core, 64 GB system RAM), and I tried out some recommended models with ollama a month or two ago. Anything not geared specifically towards coding seemed to struggled with actually making tool calls instead of just stating the actions they would take without making them (and trying to get help from them to explain what I needed to configure to change that behavior was useless; qwen refused to believe that it was running in ollama and insisted that it was running from the Alibaba cloud without access to my local system), and the models intended for coding were barely thinking faster than I could type (if they had any ability to show thinking at all).
The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.
IMO running local models "well" still requires an expensive hardware investment. You really want 96GB of VRAM on a modern Blackwell arch to run these models with decent KV cache. Trying to run them on a unified memory Mac, an AI Max AMD processor, or a DGX Spark-alike is really just asking for trouble. Prefill kills perf.
If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.
Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.
Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.
Gemma 4 is particularly good at pipeline/automation tasks.
It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.
Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)
But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.
I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.
I agree that for coding/creation use cases, there's still not a compelling argument for local models.
But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.
> You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes
This is sadly also my experience. I wish we had some MoE models with a higher ratio of active parameters per total. My experience is that the newer MoE models that can run in a 64b laptop have too few active parameters to be useful outside narrower, specific tasks. Mixtral 8x7b was a 14b active parameter (56b total) MoE model a few years ago and was probably the best model one could run in that range for some time, but it is too old now.
I have been using the qwen 27b and it is great, but running a dense model like this in a macbook is a bit suboptimal, and i wish I could run sth faster than 15 tok/s.
I think you’re spot on. In my experience people confuse a models ability to solve some benchmark as a sign of its usefulness. Token throughput is often just as important from my personal usage. I am excited for more diffusion models to see how progress happens there.
To be honest even the cloud models are a hot mess at times. This week I’ve spent more time rejected code from OpenAI models than I have approving it.
In fact it really feels like OpenAI models have taken a nose dive this week compared with Claude. At least for my specific workloads (these things are so variable it’s like trying to compare Google results…)
I've been using unsloth/gemma-4-31B-it-qat-GGUF daily for various small parsing and programming tasks using opencode and llama-server's front end. The past couple of weeks have made a big difference after google released the QAT variant and llama.cpp got support for MTP which means it is possible to now get 60-80 Tok/s with RTX 4090. The model fits in VRAM comfortably enough to keep it loaded even while browsing and having multiple programs.
Those dense models are pretty fast with MTP now. 40-70TK/s depending on your machine, that's faster than cloud models (although not as smart obviously).
Depends on what you mean by "local". On your Macbook, large dense models like Qwen 3.6 27B will be slow, sure. On a local workstation with a dedicated RTX card you can get > 100 tps, which is more than good enough to work with it, and faster than cloud models in many cases.
A median laptop is no bueno for running a reliable model(which will be qwen 27b as per my reading here and r/localllama). Powerful macs would be prevalent in certain areas of the world but in rest of the world personal machines aren't always that powerful.
Kimi 2.6 or 2.8 is what we are playing with locally. They need 512GB to 1TB to run with full capabilities so that's not exactly "desktop"
Our GPU computer server cost $110k.
> On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.
Laptop?
OK, I've made that mistake before. I understand modern laptops are powerful, but nobody wanting to do serious AI/ML work should be using a laptop for anything other than SSH or similar low-performance access into a proper system.
Years ago I fried two laptops just doing finite element analysis work running 18+ hours per day. It was one of those "I'm giving you all she's got, Captain!" workloads. They fried, even with powerful fans cooling them. I should have known better. Such workloads belong on purpose built systems.
I largely don't disagree with you but come to a different conclusion. I have two systems:
1) a "programming desktop" with a $500 upper mid range Ryzen (idr exact), 8GB VRAM Radeon card I bought solely for RuneScape, and 64GB ram
2) a maxed out Alienware 16 Area51, so it's a 5090 with 24GB vram and 64GB system ram. I bought it for gaming, of course.
I run qwen 3.6 35B A3B Q6 with 200k context window. I compare this to Claude pro max or whatever that I use at work.
The main difference between the machines is that the one with the RuneScape gpu does 10 TPS while the Alienware does 30-40tps. Both are fine though the 30-40tps is obviously a lot snappier.
I find with both models that:
- they do really well at "be a 30GB zip file of reddit and stackoverflow answers"
- they do really well at point fixing random bullshit errors that would otherwise waste my time (this is related to above of course)
- they do quite well at, given a pretty good specification of what you want, figuring it out, even if you've specified several steps needed
- they both cannot really be given a large ish task and left to just drive it on their own
The main difference between the two is with that last one, Claude is somewhat better and figuring SOMETHING out, but if Claude is having to figure it out, it's probably because I don't know what I want and it's very likely to not make a sane choice, and will generally produce slop given even the slightest amount of leash still.
I've also found that the boundary between "well specified small to medium thing" and "idk just do thing and figure it out" is the difference between you keeping control of the code and losing control. There's an "escape velocity" of AI use that, when you hit it, you're doomed to slop forever. (Or you have to deorbit... enjoy that). And while claude might have slightly higher velocity allowed while remaining suborbital, it's very diminishing returns.
So, are these models "worse" than Claude? Yeah. Am I looking forward to continued improvements? Yeah. But I now also have no desire to pay anthropic any amount of money, which has the nice side effect that i won't be helping them end up with so much money that they can distort our democracy.
What counts as a lot of memory? What could someone do with 16 GB of RAM?
4 bit unsloth quants are good if you never ask for more than 20k context, use it as autocomplete on steroids, and never delegate serious questions to it
They are good if you were clever enough to buy a powerful enough rig before memory went up. For everyone else I say just wait. M1 Ultra 128GB and higher is sufficient to run gemma4:31b-mlx or qwen3.6:35b-mlx with subagents. It’s only slow if you don’t know how to plan your work effectively.
maybe painful if you are using it like a chatbot. you are sitting there waiting for response. vs ambient ai like automatically classifying your family pics and discarding random things like parking floor number pic.
i use it usecases like that latter and they are fine.
They are still terrible at tool usage which loses 99% of the effectiveness of the agent. I've had to concede and use paid frontier models that can use tools or its not worth using agents....copy...paste....copy....paste....
I wonder if it is better to have a machine somewhere running a model for you maybe shared with a few others. I could probably justify a M6 Mac Studio with hopefully 256gb RAM and have a few people all with access to one agreed upon model. I think maybe laptops are too warm and clunky for this.