logoalt Hacker News

jmward01yesterday at 8:30 PM1 replyview on HN

I think you are making my point. Having a little slower, but a lot more, memory on the card would speed this use-case up a lot and remove the need to go to system memory or make it available for very rarely used experts allowing for even larger MOE models running with good performance.


Replies

zozbot234yesterday at 8:48 PM

I think speeding up long context and opening up the use of models with larger shared layers is ultimately more relevant than hosting unused MoE layers. Of course you could do that as a last resort, i.e. when running with a smaller context that leaves some VRAM free to use.

show 1 reply