logoalt Hacker News

barbegaltoday at 1:13 PM1 replyview on HN

Does the KV cache really grow to use more memory than the model weights? The reduction in overall RAM relies on the KV cache being a substantial proportion of the memory usage but with very large models I can't see how that holds true.


Replies

zozbot234today at 2:57 PM

For long context, yes this is at least plausible. And the latest models are reaching context lengths of 1M tokens or perhaps more.