But the RAM+VRAM can never be less than the size of the total (not active) model, right?

theanonymousone • yesterday at 10:17 AM • 1 reply • view on HN

Replies

NitpickLawyer • yesterday at 10:19 AM

Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model.

That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.

➕ show 1 reply

alt Hacker News

Replies