logoalt Hacker News

hypfertoday at 8:18 AM0 repliesview on HN

TL;DR (and please correct me if I got it wrong):

Tiny deterministic model predicts the K/V cache, prediction is compared with reality, delta is stored in vram. The other way round then just predicts the values again, applies the delta, and you have the full correct value while just storing the delta

And this works because you're never looking at the whole k/v cache but always just a slice. So you just need a memory buffer of the size of the slice

___

If this works out and I've understood correctly, that _I think_ would mean that a 24GB RTX 4090 could fit 256k q8 context next to Qwen3.6-27B at IQ4_NL.

Or, alternatively, something like 208k context (matching claude api limits of 200k in some plans) with a slightly larger quant like UD-Q4_K_XL.

That would be massive. Especially since the thing has so much compute to spare.

Though, all depending on the size of that predictor model I guess?