I assumed the 27B dense model would be preferable to a MoE model, and that it wouldn’t fit into a co...

JSR_FDED • today at 12:08 PM • 3 replies • view on HN

I assumed the 27B dense model would be preferable to a MoE model, and that it wouldn’t fit into a consumer graphics card, which leaves the Macs.

Then I assumed for cost and battery/heat reasons that a Mini would be better than a laptop.

Replies

SwellJoe • today at 6:46 PM

The current dense models from Gemma 4 or Qwen 3.6 families will run well on a consumer GPU with 32GB in a 4-bit quantization (which is a little lossy for Qwen 3.6, not so much for Gemma 4, as it has a QAT 4-bit version). Even an Intel ARC B70 will work, though it's worth spending a little more for a the AMD Radeon AI Pro 9700, as it'll be like 40% faster, I think. A dedicated GPU will be faster and cheaper than a Mac Mini. But, nothing is a good deal right now, everything is overpriced (except DeepSeek tokens, which cost pennies to run a model that's better than anything you could self-host...DeepSeek V4 Flash, and even Pro, are absurdly cheap, made even cheaper by their bonkers cheap cached token pricing and uniquely effective caching).

mswphd • today at 4:26 PM

dense models are (more) compute heavy, so are generally worse to run on mac. mac tends to be better for (larger) MoE models.

27B dense can fit on a consumer graphics card. Even without getting into various "intrusive" ways to shrink the size of a model (e.g. REAP), something like a NVFP4 quant of Qwen3.6 27b

https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4

should fit within ~22GB of VRAM. So easily on a 5090. It would also fit on a 3090/4090, but iirc they don't have NVFP4 natively, so you would want a different quant for them.

you can see /r/LocalLLama for some discussions. See this (random) post about Qwen3.6-27B on a 3090 at ~100 tok/s

https://www.reddit.com/r/LocalLLaMA/comments/1ujo46r/qwen_36...

Note that it is possible you could still do this stuff with a mac, as there are ways of hooking up a eGPU to macs and using it for inference. My understanding is they're all fairly hacky though, so it would likely be preferrable to just get a 3090 (or a non-nvidia option, e.g. an AMD r9700 pro has ~32GB of VRAM for much cheaper than a 5090.

https://www.reddit.com/r/LocalLLaMA/comments/1u50hnm/qwen_27...

that seems considerably slower though (~30 tok/s). I don't know if that's an outlier/misconfigured setup or what. In general there will be much better resources for local setups using 3090s, as they're quite popular. Note that 3090s (but not 4090s nor 5090s) have NVLink, so you can network the cards fairly effectively. For this reason 2x 3090 setups are fairly popular as well. I've heard that club 3090 makes that relatively straightforward

https://github.com/noonghunna/club-3090

but don't have experience myself.

blensor • today at 12:20 PM

The reason why I was curious is that I am running my stuff on a Strix Halo and I get the feeling that this class of devices ( gmktek, minisforum, lenovo, etc. ) seem to becoming a pretty good alternative

➕ show 2 replies

alt Hacker News

Replies