Additional compute is generally a win for prefill, while memory bandwidth is king for decode. KV cac...

zozbot234 • last Thursday at 10:24 AM • 1 reply • view on HN

Additional compute is generally a win for prefill, while memory bandwidth is king for decode. KV cache however is the main blocker for long context, so it should be offloaded to system RAM and even to NVMe swap as context grows. Yes that's slow on an absolute basis but it's faster (and more power efficient, which makes everything else faster) than not having the cache at all, so it's still a huge win.

Replies

moffkalast • last Thursday at 6:37 PM

Well if you do that then you reverse the strengths of your system. It might be best to work with the context length you can offload, like a normal person.

alt Hacker News

Replies