> The model weights stay resident in VRAM permanently so there's no loading/unloading p...

kaoD • yesterday at 7:15 PM • 2 replies • view on HN

> The model weights stay resident in VRAM permanently so there's no loading/unloading per request.

Yes, I was thinking about context buffers, which I assume are not small in large models. That has to be loaded into VRAM, right?

If I keep sending large context buffers, will that hog the batches?

jrandolf • yesterday at 8:14 PM

Not if you are the only one. We have rate limits to prevent this in case, idk, you share your key with 1000 people lol.

alt Hacker News