logoalt Hacker News

zozbot234last Thursday at 10:24 AM1 replyview on HN

Additional compute is generally a win for prefill, while memory bandwidth is king for decode. KV cache however is the main blocker for long context, so it should be offloaded to system RAM and even to NVMe swap as context grows. Yes that's slow on an absolute basis but it's faster (and more power efficient, which makes everything else faster) than not having the cache at all, so it's still a huge win.


Replies

moffkalastlast Thursday at 6:37 PM

Well if you do that then you reverse the strengths of your system. It might be best to work with the context length you can offload, like a normal person.