But that's only for prefilling right? Or is it beneficial for decoding too (I guess you can do ...

liuliu • last Friday at 10:51 PM • 2 replies • view on HN

But that's only for prefilling right? Or is it beneficial for decoding too (I guess you can do KV lookup on shards, not sure how much speed-up that will be though).

Replies

zackangelo • last Friday at 11:00 PM

No you use tensor parallelism in both cases.

The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.

EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)

➕ show 1 reply

monster_truck • last Friday at 11:25 PM

Even if it wasn't outright beneficial for decoding by itself, it would still allow you to connect a second machine running a smaller, more heavily quantized version of the model for speculative decoding which can net you >4x without quality loss

alt Hacker News

Replies