logoalt Hacker News

zozbot234last Saturday at 11:37 PM1 replyview on HN

Normally, experts are picked for every layer not just every token. But there are plausible ways of getting around that bottleneck while streaming if you can batch many inferences together. Still, the Apple approach of swapping the experts only rarely is interesting, though it likely degrades the model a lot.


Replies

FridgeSealyesterday at 1:07 AM

Just get the bigger models to figure out the architecture required for hot-swappable sub-experts without loss of performance!

Got all those tokens, isn’t that the point of auto research and friends??

(Only sort of joking).