I think that’s only true for MoE models. A dense model like 3.6 27b will require more (plus a KV sto...

jnovek • last Monday at 8:06 PM • 1 reply • view on HN

I think that’s only true for MoE models. A dense model like 3.6 27b will require more (plus a KV store).

Replies

No, even MoE models need to fit into (V)RAM. MoE has faster inference because only a subset of layers are used to predict the next token, but the set of layers used changes with every token.

alt Hacker News

Replies