You don't even need system RAM for the inactive experts, they can simply reside on disk ...

zozbot234 • yesterday at 9:19 PM • 1 reply • view on HN

You don't even need system RAM for the inactive experts, they can simply reside on disk and be accessed via mmap. The main remaining constraints these days will be any dense layers, plus the context size due to KV cache. The KV cache has very sparse writes so it can be offloaded to swap.

Replies

nl • yesterday at 10:52 PM

Are there any benchmarks (or even vibes!) about the token/second one can expect with this strategy?

➕ show 1 reply

alt Hacker News

Replies