A 5T MoE model is still bottlenecked by streaming weights from SSD, in addition to compute bottlenec...

bigyabai • yesterday at 5:37 PM • 1 reply • view on HN

A 5T MoE model is still bottlenecked by streaming weights from SSD, in addition to compute bottlenecks during prefill and decode.

Replies

zozbot234 • yesterday at 9:24 PM

True but a cluster built on pipeline parallelism can naturally stream from multiple SSD's in parallel. That probably makes offload somewhat more effective. And you also have RAM caching available as a natural possibility.

➕ show 1 reply

alt Hacker News

Replies