logoalt Hacker News

jmward01today at 6:40 PM1 replyview on HN

I think this shows a shift in model architecture. MOE and similar need more memory for the compute available than just one big model with a lot of layers and weights. I think this is likely a trend that will accelerate. You build the trade-off in which encourages even more experts which means more of a tradeoff, so more experts.....


Replies

zozbot234today at 7:21 PM

Most people doing local inference run the MoE layers on CPU anyway, because decode is not compute constrained and wasting the high-bandwidth VRAM on unused weights is silly. It's better to use it for longer context. Recent architectures even offload the MoE experts to fast (PCIe x4 5.0 or similar performance) NVMe: it's slow but it opens up running even SOTA local MoE models on ordinary hardware.

show 1 reply