logoalt Hacker News

FuckButtonsyesterday at 7:00 PM1 replyview on HN

Not entirely true, it’s random access within the relevant subset of experts and since concepts are clustered you actually have a much higher probability of repeatedly accessing the same subset of experts more frequently.


Replies

vlovich123today at 1:26 AM

It’s called mixture of experts but it’s not that concepts map cleanly or even roughly to different experts. Otherwise you wouldn’t get a new expert on every token. You have to remember these were designed to improve throughput in cloud deployments where different GPUs load an expert. There you precisely want each expert to handle randomly to improve your GPU utilization rate. I have not heard anyone training local MoE models to aid sharding.

show 1 reply