logoalt Hacker News

stavrosyesterday at 11:57 PM1 replyview on HN

Probably because the costly operation is loading it onto the GPU, doesn't matter if it's from disk or from your request.


Replies

zozbot234today at 12:12 AM

The point of prompt caching is to save on prefill which for large contexts (common for agentic workloads) is quite expensive per token. So there is a context length where storing that KV-cache is worth it, because loading it back in is more efficient than recomputing it. For larger SOTA models, the KV cache unit size is also much smaller compared to the compute cost of prefill, so caching becomes worthwhile even for smaller context.