You can cache K and V matrices, but for such huge matrices you'll still pay a ton of compute to...

sigmoid10 • yesterday at 1:00 PM • 1 reply • view on HN

You can cache K and V matrices, but for such huge matrices you'll still pay a ton of compute to calculate attention in the end even if the user just adds a five word question.

Replies

pests • today at 2:12 AM

The state of the system can be cached after the system prompt is calculated and all new chats start from that state. O(n^2) is not great but apparently its fine at these context lengths and I'm sure this is a factor in their minimum prompt cost. Advances like grouped query or multi head attention or sparse attention will eventually get rid of that exponential, hopefully.

alt Hacker News

Replies