MoE models need just as much VRAM as dense models because every token may use a different set of exp...

EnPissant • yesterday at 9:12 AM • 1 reply • view on HN

MoE models need just as much VRAM as dense models because every token may use a different set of experts. They just run faster.

Replies

regularfry • yesterday at 9:32 AM

This isn't quite right: it'll run with the full model loaded to RAM, swapping in the experts as it needs. It has turned out in the past that experts can be stable across more than one token so you're not swapping as much as you'd think. I don't know if that's been confirmed to still be true on recent MoEs, but I wouldn't be surprised.

➕ show 2 replies

alt Hacker News

Replies