logoalt Hacker News

cold_harbortoday at 5:35 PM3 repliesview on HN

their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is actually cheaper to run, not just a price war to gain market share.


Replies

zozbot234today at 5:58 PM

That's also a game changer for local inference. It unlocks long contexts, batched inference and storing the KV cache to disk on ordinary consumer platforms.

vitorsrtoday at 7:51 PM

Yes. The discount was most likely a "post-market trial" of how efficient the caching works for the new generation models.

show 1 reply
hmaddipatlatoday at 6:30 PM

[dead]