> Storing on GPU would be the absolute dumbest thing they could do No. It’s not dumb. There wil...

libraryofbabel • today at 4:28 AM • 0 replies • view on HN

> Storing on GPU would be the absolute dumbest thing they could do

No. It’s not dumb. There will be multiple cache tiers in use, with the fastest and most expensive being on-GPU VRAM with cache-aware routing to specific GPUs and then progressive eviction to CPU ram and perhaps SSD after that. That is how vLLM works as you can see if you look it up, and you can find plenty of information on the multiple tiers approach from inference providers e.g. the new Inference Engineering book by Philip Kiely.

You are likely correct that the 1hr cached data probably mostly doesn’t live on GPU (although it will depend on capacity, they will keep it there as long as they can and then evict with an LRU policy). But I already said that in my last post.

alt Hacker News