Probably because the costly operation is loading it onto the GPU, doesn't matter if it's f...

stavros • yesterday at 11:57 PM • 1 reply • view on HN

Probably because the costly operation is loading it onto the GPU, doesn't matter if it's from disk or from your request.

Replies

zozbot234 • today at 12:12 AM

The point of prompt caching is to save on prefill which for large contexts (common for agentic workloads) is quite expensive per token. So there is a context length where storing that KV-cache is worth it, because loading it back in is more efficient than recomputing it. For larger SOTA models, the KV cache unit size is also much smaller compared to the compute cost of prefill, so caching becomes worthwhile even for smaller context.

alt Hacker News

Replies