What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram.
You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.
If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.
32GB vram is more than enough for Qwen 3.5 35b
You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.
If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.