logoalt Hacker News

zozbot234today at 7:21 PM1 replyview on HN

Most people doing local inference run the MoE layers on CPU anyway, because decode is not compute constrained and wasting the high-bandwidth VRAM on unused weights is silly. It's better to use it for longer context. Recent architectures even offload the MoE experts to fast (PCIe x4 5.0 or similar performance) NVMe: it's slow but it opens up running even SOTA local MoE models on ordinary hardware.


Replies

jmward01today at 8:30 PM

I think you are making my point. Having a little slower, but a lot more, memory on the card would speed this use-case up a lot and remove the need to go to system memory or make it available for very rarely used experts allowing for even larger MOE models running with good performance.

show 1 reply