logoalt Hacker News

suprjamitoday at 7:47 PM0 repliesview on HN

Some models really suffer badly from KV quantisation. You can also take a speed hit using dissimilar K and V types.

TurboQuant seems to be the next big thing in context memory usage. Polar coordinates achieving ~5x reduction in memory usage with minimal/no quality loss, and even a slight speedup in some cases.