Normally, experts are picked for every layer not just every token. But there are plausible ways of ...

zozbot234 • last Saturday at 11:37 PM • 1 reply • view on HN

Normally, experts are picked for every layer not just every token. But there are plausible ways of getting around that bottleneck while streaming if you can batch many inferences together. Still, the Apple approach of swapping the experts only rarely is interesting, though it likely degrades the model a lot.

Replies

FridgeSeal • yesterday at 1:07 AM

Just get the bigger models to figure out the architecture required for hot-swappable sub-experts without loss of performance!

Got all those tokens, isn’t that the point of auto research and friends??

(Only sort of joking).

alt Hacker News

Replies