Most people doing local inference run the MoE layers on CPU anyway, because decode is not compute co...

zozbot234 • today at 7:21 PM • 1 reply • view on HN

Most people doing local inference run the MoE layers on CPU anyway, because decode is not compute constrained and wasting the high-bandwidth VRAM on unused weights is silly. It's better to use it for longer context. Recent architectures even offload the MoE experts to fast (PCIe x4 5.0 or similar performance) NVMe: it's slow but it opens up running even SOTA local MoE models on ordinary hardware.

Replies

jmward01 • today at 8:30 PM

I think you are making my point. Having a little slower, but a lot more, memory on the card would speed this use-case up a lot and remove the need to go to system memory or make it available for very rarely used experts allowing for even larger MOE models running with good performance.

➕ show 1 reply

alt Hacker News

Replies