logoalt Hacker News

jychangyesterday at 12:38 PM1 replyview on HN

32GB vram is more than enough for Qwen 3.5 35b

You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.

If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.


Replies

roxolotlyesterday at 1:11 PM

Nice ok I’ll play with that. I’m mostly just learning what’s possible. Qwen 3.5 35b has been great without any customizations but it’s interesting to learn what the options are.