logoalt Hacker News

zozbot234today at 10:51 AM0 repliesview on HN

That's pretty nice actually, how much KV cache does that model require at full context? That tends to be the main limit to running concurrent requests locally, there's KV quantization but it has outsized negative impact on model quality.