The KV cache consists of activation vectors for every attention head at every layer of the model for every token, so it gets quite large. ChatGPT also estimates 60-100GB for full token context of an Opus-sized model:
https://chatgpt.com/share/69dc5030-268c-83e8-92c2-6cef962dc5...
There are ways to quantize or compress KV cache down.
That is actually nuts.... I'm trying to understand the true costs of AI, wonder how I plug this in!