KV cache size is the main constraint on batching (for any given ctx length), that's a huge

zozbot234 • yesterday at 6:04 PM • 1 reply • view on HN

KV cache size is the main constraint on batching (for any given ctx length), that's a huge deal for efficiency both locally and in the data center. DeepSeek V4's reduced KV requirement is a real game changer, it definitively unlocks batching requests together for local inference, not just at scale.

Replies

anonym29 • yesterday at 6:20 PM

This may be relevant for parallelizable workloads. For reference on my perspective: I come at this as someone who is exclusively concerned with sequential, non-parallelizable, single-user, single-system workloads.

➕ show 1 reply

alt Hacker News

Replies