logoalt Hacker News

vessenesyesterday at 11:50 AM2 repliesview on HN

I think you mean inference compute? I believe all expert weights are updated in each backward pass during MoE training. The first benefit was getting a sort of structured pruning of weights through the mechanism of expert selection so that the model didn’t need to go through ‘unnecessary’ parts of the model for a given token. This then let inference use memory more efficiently in memory constrained environments, where non-hot or less common experts could be put into slow RAM, or sometimes even streamed off storage.

But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how!


Replies

agunapaltoday at 10:21 AM

Here is a paper from few years ago where they talk about 7x speed increase, which equates to savings.

https://arxiv.org/abs/2101.03961

bjourneyesterday at 12:26 PM

Each token is only routed through a few chosen (topk) experts during training. So not all expert weights are updated in the backward pass. Otoh, you may need more training to ensure all experts see enough tokens!

I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.