logoalt Hacker News

EnPissantyesterday at 10:32 PM1 replyview on HN

The ability to stream weights from disk has nothing to do with MoE or not. You can always do this. It will be unusable either way.


Replies

zozbot234yesterday at 10:55 PM

Agreed but for a dense model you'd have to stream the whole model for every token, whereas with MoE there's at least the possibility that some experts may be "cold" for any given request and not be streamed in or cached. This will probably become more likely as models get even sparser. (The "it's unusable" judgmemt is correct if you're considering close-to-minimum reauirements, but for just getting a model to fit, caching "almost all of it" in RAM may be an excellent choice.)

show 1 reply