logoalt Hacker News

coolspotyesterday at 7:47 PM1 replyview on HN

Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token.


Replies

drozyckiyesterday at 8:10 PM

The latest SSDs benchmark at 3GB/s and up. The marginal latency would be trivial compared to the inference time.