their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is a...

cold_harbor • today at 5:35 PM • 3 replies • view on HN

their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is actually cheaper to run, not just a price war to gain market share.

Replies

zozbot234 • today at 5:58 PM

That's also a game changer for local inference. It unlocks long contexts, batched inference and storing the KV cache to disk on ordinary consumer platforms.

vitorsr • today at 7:51 PM

Yes. The discount was most likely a "post-market trial" of how efficient the caching works for the new generation models.

➕ show 1 reply

hmaddipatla • today at 6:30 PM

[dead]

alt Hacker News

Replies