GPU VRAM has an opportunity cost, so caching is never free. If that RAM is being used to hold KV cac...

mike_hearn • yesterday at 10:22 AM • 1 reply • view on HN

GPU VRAM has an opportunity cost, so caching is never free. If that RAM is being used to hold KV caches in the hope that they'll be useful in future, but you lose that bet and you never hit that cache, you lost money that could have been used for other purposes.

Replies

TZubiri • yesterday at 9:02 PM

that cost is proportional to how long the cache is held. Currently the cache is not application controlled, it's like CPU caches.

If you hit the cache 1ns after it's been held, you get charged the same as if it's held for 5 minutes or 1 hour.

Also, in terms of LLM APIs, I'm almost certain that the state is offloaded onto RAM and then reloaded onto the GPU memory. If you are renting a GPU, you could keep the inferred state in GPU memory. If you are just holding it for very short periods of time, like my example of generating 1 output token at a time and doing some programmatic logic, then it's currently prohibitively expensive to use an API and you must self-host.

alt Hacker News

Replies