Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model.
That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.
Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model.
That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.