Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token.
The latest SSDs benchmark at 3GB/s and up. The marginal latency would be trivial compared to the inference time.
The latest SSDs benchmark at 3GB/s and up. The marginal latency would be trivial compared to the inference time.