The model is 80b parameters, but only 3b are activated during inference. I'm running the...

bigyabai • yesterday at 4:47 PM • 2 replies • view on HN

The model is 80b parameters, but only 3b are activated during inference. I'm running the old 2507 Qwen3 30B model on my 8gb Nvidia card and get very usable performance.

Replies

jwr • today at 7:43 AM

I understand that, but whether it's usable depends on whether ollama can load parts of it into memory on my Mac, and how quickly.

coolspot • yesterday at 7:47 PM

Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token.

➕ show 1 reply

alt Hacker News

Replies