logoalt Hacker News

EnPissantyesterday at 9:12 AM1 replyview on HN

MoE models need just as much VRAM as dense models because every token may use a different set of experts. They just run faster.


Replies

regularfryyesterday at 9:32 AM

This isn't quite right: it'll run with the full model loaded to RAM, swapping in the experts as it needs. It has turned out in the past that experts can be stable across more than one token so you're not swapping as much as you'd think. I don't know if that's been confirmed to still be true on recent MoEs, but I wouldn't be surprised.

show 2 replies