You need the rest of the ram for the context. If you don't want to end up with a toy context or...

ElectricalUnion • today at 11:38 AM • 1 reply • view on HN

You need the rest of the ram for the context. If you don't want to end up with a toy context or quantized lossy context, is pretty easy to end up having to spend up 50+GB just for the KV cache, per simutaneous inference slot.

Replies

zozbot234 • today at 12:13 PM

[dead]

alt Hacker News

Replies