Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM,...

coolspot • yesterday at 7:47 PM • 1 reply • view on HN

Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token.

Replies

drozycki • yesterday at 8:10 PM

The latest SSDs benchmark at 3GB/s and up. The marginal latency would be trivial compared to the inference time.

alt Hacker News

Replies