Qwen 3.6 27B is the sweet spot for local development

1136 points • by stared • yesterday at 5:05 PM • 702 comments • view on HN

Comments

Been running it on a 9950x3D with 96GB and a 4090. Speedwise it is fine. Bit while not completely useless, for software development it is unsurprisingly a dramatic downgrade from the Opus I use as my daily driver.

blagui • yesterday at 10:24 PM

How you can do dev in 2026 using 64k context and without sub agents?

The benchmark seemed fine until I saw that.

If you use sub agents, they will overwrite the cache and each request will trigger full reprocessing. Have fun with that as it will crash the t/s metrics on each prefill on top of the max 64k including input + output is a major blocker.

If you push the context higher and add parallel slots the requirements will be far higher and the numbers less shiny.

diseasedyak • yesterday at 7:43 PM

I have 24GB of VRAM (via a RTX 4090) and run Qwen3.6-35b:iq4, so it's importance-aware quantization and isn't nearly as dumb as it sounds like, fitting the 35b into 18 GB so you have some left over. So far I've had no issues, other than it taking a while for things like image gen, which I found out if you're gonna do with any alacrity, just have a cloud model do it.

For anything else local, including writing some automation scripts and such, it works great.

➕ show 2 replies

meta-level • today at 9:30 AM

why does everyone imply you need a $10k laptop which then starts burning when you run Qwen 3.6? Get any other system with enough VRAM for a third of the price. Framework Desktop (Strix Halo 128GB) still costs under 4k nowadays, is nearly silent even on 100% GPU + CPU. (also it gets only slightly 'warm', but with a desktop you don't care anyway, I guess).

➕ show 1 reply

markdog12 • yesterday at 6:14 PM

I've tested it extensively for actual local development for my job, and hard disagree here. It's a waste of time to use it. Wish it were not true.

➕ show 1 reply

christoff12 • yesterday at 9:05 PM

I just burned 20 minutes because I wanted to play hex minesweeper: https://hexabomb.pgpln.app

Source: https://chatgpt.com/share/6a42dd8a-4e28-83e8-9ef7-6ba56d665c...

➕ show 1 reply

kristopolous • today at 1:21 AM

Help me improve local model performance with petsitter!

It basically exploits the face that time can be traded for intelligence with local models

https://github.com/day50-dev/Petsitter

drnick1 • yesterday at 10:49 PM

Has anyone managed to cleanly integrate Web search into local models (run with llama.cpp)? The biggest limitation of the class of models that fit into one or two consumer GPUs is that they lack world knowledge, but presumably this can be remedied by enabling access to use the Internet.

➕ show 3 replies

blobbers • yesterday at 5:29 PM

How does llama.cpp use the GPU efficiently as opposed to MLX?

Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?

TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.

If I can generate voice at the same time as video, that would be useful.

➕ show 1 reply

letmetweakit • today at 10:57 AM

Any chance to run this on a RTX 3090 and 64GB of regular RAM with decent context size?

recursivedoubts • yesterday at 9:09 PM

I would like to offer someone the next openclaw: a GUI for the mac that allows people to manage and install local models with a single click, provides GUI tools for tweaking important aspects of them, and also provides a good command line interface to those models.

➕ show 1 reply

drillsteps5 • yesterday at 7:46 PM

I honestly don't get the hostility against local models in this thread (and in some other threads recently).

I haven't seen anyone make an argument they are as good as SotA (OpenAI, Anthropic). It's just they are approaching state where they are "as good" for some _limited_ set of use cases. Which will allow us to resolve 2 primary issues with these SotA models: privacy and vendor lock-in. Plus, they're very useful for education purposes, you get to explore what things looks like under the hood, play with various models, tools, maybe put something simple together yourself.

You get Macbook - great. You got gaming rig with a decent GPU - great (set it up as a dedicated server that you connect to through simple REST).

What exactly is wrong with any of that?

amlord • today at 11:13 AM

Tried looking at it, but needs a much beefier machine than I have RN.

Hopefully we're looking at a future where local models become more & more realistic to use for reducing remote AOI spend.

narrator • yesterday at 7:28 PM

In hindsight, the Mac 512gb for about $10k was a total steal given that to run GLM 5.2 you need a 4x H100 to get the necessary amount of VRAM. Yeah the h100 is 2 to 8 times faster, but it's $20k a month to rent a 4xH100 VPS.

aichi • today at 2:55 PM

What model fits on 36GB RAM mac?

cdnsteve • yesterday at 7:59 PM

Checkout details on what this runs on for local AI here: https://tokenstead.ai/models/qwen3-6-27b

kopirgan • today at 3:40 AM

Lost count of number of times I read this or similar:

For me it’s the first local model that actually makes sense as a general intelligence.

v3ss0n • yesterday at 9:05 PM

3.5 122B is much better. 27 B is bad at Long context and Svelte

macwhisperer • yesterday at 11:18 PM

hi guys... I run specialized quants on my 24gb air.. (I specialize in 3-bit quants that punch above their weight).. try out my version of 3.6-27b I think you be impressed https://huggingface.co/macwhisperer/Qwen3.6-27B-SuperDense

max8539 • yesterday at 10:15 PM

Running this model on a 48 GB memory MacBook Pro when offline, it performs its tasks, but of course, it’s slower than Claude or Codex.

taf2 • today at 1:10 AM

Best way to make your M series macbook pro feel like a good old fashion intel macbook pro. Run a local model.

senorqa • today at 12:45 AM

On AMD R9700, I'm getting ~90 t/s with 35b MTP variant and ~40t/s with dense 27b MTP

cloudengineer94 • yesterday at 9:52 PM

I'm using Qwen and Gemma 4 locally and it's pretty good stuff, not frontier level but gets the job done.

hoppp • yesterday at 9:34 PM

Its feasible but that laptop is not available for most devs.

I do have access for a 64 gb ram mac mini but most people don't.

alansaber • yesterday at 8:23 PM

Is qwen finetuned/RL'd on any agent harness? Or does it just work well enough off the bat with opencode?

anonym29 • yesterday at 5:35 PM

Strix Halo user here. While Qwen 3.6 27B exhibits remarkable intelligence density, I will still take unsloth's dynamic IQ2_XXS of Minimax M2.7 over Q8_0 Qwen 3.6 27B any day of the week, and this isn't just because of generation speed either. I wrote my own custom harness, and I get hallucinated tool call parameters and bizarre invocations with Q3.6 27B even at Q8_0, but no issues with the IQ2_XXS of M2.7.

➕ show 1 reply

macwhisperer • yesterday at 11:21 PM

also for those with only 16gb-- try this model https://huggingface.co/macwhisperer/Gemma4-12B-SuperDense its exceptional!

agenticup • today at 6:13 AM

qwen 3.6 27b and qen35b a3b work like magic, if we get dpark speculative decoding versions of these models it will further improve the throughput

felooboolooomba • yesterday at 8:19 PM

What's the minimum requirement for a Nvidia card to run it? For let's say 10 t/s.

zerolines • yesterday at 8:54 PM

Yup, been rocking theQwen3.6-35B-A3B-MTP-GGUF locally with 88tk/s it's amazing.

konart • today at 6:12 AM

>Real work

This part should have featured something about real work. But instead it features a paragraph about one-shot bs that creates "something".

Unless your work is to create thousands wordpress tremplates to sell - this is not a "real work".

Give it a repository (any kind of OSS project will do for an example) and a github issue requesting a knew feature or describing a confirmed bug. (you can and probably should write a prompt for LLM shough, don't just provide the issue itself)

And then whatch it go.

And then judge the result and it's quality.

Sorry, but from my experience 27B is just useless. You do get a result and some times it does work, but most of the times it is not event on junior dev level. And it takes it a lot of time to do the thing, unless you have an extremely expensive machine.

➕ show 1 reply

devin • yesterday at 8:24 PM

If I have 10k to spend, what should I buy for the best local model experience?

fossheart • today at 2:58 AM

> I recommend llama.cpp - a direct, open source tool that allows running models on various devices. You don’t need Ollama, and frankly - I would recommend against using that on ethical grounds.

> https://sleepingrobots.com/dreams/stop-using-ollama/

I had faced roadblocks while integrating with openclaw using ollama (Was trying to experiment with `qwen3-vl:2b`). I was tracking the issue back to openclaw at that time, I didn't even consider investigating ollama.

I attached a threads post here where I'm talking to meta ai to expand on both scenarios (not to use ollama, but llama.cpp & my take on the why this is the way it is - ie. commercial gains)

https://www.threads.com/@riojos/post/DaMXIs4k4w8

mannyv • yesterday at 7:08 PM

FYI token speed is somewhat irrelevant for agentic development. You let it run, then you come back. The whole point is that it's asynchronous. If it takes 4 hours, 8 hours, 16 hours...who cares?

➕ show 1 reply

happyash1 • today at 5:00 AM

Qwen is so good a model.

LoganDark • today at 9:51 AM

I see OpenCode mentioned in the article, and I would strongly warn against using it for local development because it disrespects caching (the content of the first turn / system prompt is NOT stable). I use Pi which works much better.

dmezzetti • yesterday at 6:15 PM

Local models are great for a lot of things past just software development. We need to move towards solving other real world problems vs just building software. I've been focused on that with TxtAI (https://github.com/neuml/txtai) for 6 years now.

cat_plus_plus • yesterday at 6:06 PM

Gemma4 31B with MTP enabled is faster and I feel a bit stronger at coding. Either one can run in 32GB VRAM or unified RAM with some tuning (3 bit weights, 8 bit kv cache)

verdverm • yesterday at 6:02 PM

Qwen's new AgentWorld model is good too: https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B

I'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark

➕ show 1 reply

ascii0eks84 • yesterday at 5:33 PM

Very capable lora adapters are surfacing but it seems they are very niche.

➕ show 1 reply

rvz • today at 5:10 AM

When reading the comments, it seems that in the AI race to zero, Apple was already at the finish line. as predicted.

So it will be no surprise that there will be a time where everyone will be able to run a local model, say GLM 5.2 locally on their machine. Like it or not.

m3kw9 • today at 3:10 AM

Hmm, i used it and it can't get past a simple coding test that 5.5 passes with light reasoning

mikert89 • yesterday at 5:35 PM

none of these local models are good for development, complete waste of time. nobody has $100k+ hardware sitting around at home to actually run a good model

➕ show 1 reply

Go7hic • today at 7:48 AM

goat

rusk • yesterday at 5:25 PM

Spent a week trying to get sensible results out of llama 3.3 At one point it even simulated doing the work, log output and everything and when I challenged it about the missing artefacts it actually started questioning my intelligence. Seems appropriate for a Zuck enterprise.

Qwen on the other hand got straight to work with astonishing competency on the same system.

From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.

➕ show 2 replies

ermantrout • today at 12:39 PM

[flagged]

john-frandsen • today at 3:22 PM

[flagged]

Nasser_CAD • today at 3:46 AM

[flagged]

cloudcanalx • today at 6:08 AM

[dead]

alt Hacker News

Qwen 3.6 27B is the sweet spot for local development

Comments

🔗 View 16 more comments