The KV cache consists of activation vectors for every attention head at every layer of the model for...

davmre • yesterday at 2:07 AM • 2 replies • view on HN

The KV cache consists of activation vectors for every attention head at every layer of the model for every token, so it gets quite large. ChatGPT also estimates 60-100GB for full token context of an Opus-sized model:

https://chatgpt.com/share/69dc5030-268c-83e8-92c2-6cef962dc5...

Replies

CraigRood • yesterday at 1:24 PM

That is actually nuts.... I'm trying to understand the true costs of AI, wonder how I plug this in!

visarga • yesterday at 12:20 PM

There are ways to quantize or compress KV cache down.

alt Hacker News

Replies