logoalt Hacker News

gcryesterday at 5:31 PM3 repliesview on HN

There are two flavors of Qwen 3.6:

- A 27B "dense" model

- A 35B "Mixture of Experts" model, which activates only 3B parameters for each token.

For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec.

The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable.


Replies

flockonusyesterday at 8:11 PM

For coding tasks 27B is reported to be much more effective, altho you can probably only run 4b or 5b quants @ this memory.

Recommend https://www.reddit.com/r/LocalLLaMA/ as a great source for this type of discussion.

pixelesqueyesterday at 6:24 PM

Thank you - I'll give that a go!

julianlamyesterday at 6:08 PM

May I ask why the M instead of XL?

Obviously bigger != better but I don't know what the differences are.

show 1 reply