The problem with this approach is that even recomputing a "draft" of the KV cache is still...

zozbot234 • today at 9:03 AM • 2 replies • view on HN

The problem with this approach is that even recomputing a "draft" of the KV cache is still quadratic in context length. Maybe you can get some constant savings by always recomputing the earliest tokens, but it's not a good tradeoff as context sizes grow.

Replies

zozbot234 • today at 3:14 PM

BTW, I forgot to mention that you can make this work in a way, but only if your model architecture generalizes the context and attention mechanism such that it's no longer a pure sequence. So you could have a large amount of distinct "early" token sequences, with each being self-contained and not depending on any other tokens, e.g. your source code files might be such. Then later parts of the context would of course depend on all of those files as usual. This makes prefill for the earlier context both reusable and cheaply recomputable throughout, at the cost of losing some dependencies that would've been previously accounted for: your model becomes faster and more efficient, but perhaps not quite as smart.

saagarjha • today at 10:54 AM

Sure, but any classical attention mechanism is quadratic in context length.

➕ show 1 reply

alt Hacker News

Replies