That doesn't work. Think about it a bit more. Hint: what's in the kv cache when you star...

jychang • today at 8:28 AM • 1 reply • view on HN

That doesn't work. Think about it a bit more.

Hint: what's in the kv cache when you start processing the 2nd token?

And that's called layer parallelism (as opposed to tensor parallelism). It allows you to run larger models (pooling vram across gpus) but does not allow you to run models faster.

Tensor parallelism DOES allow you to run models faster across multiple GPUs, but you're limited to how fast you can synchronize the all-reduce. And in general, models would have the same boost on the same hardware- so the chinese models would have the same perf multiplier as Opus.

Note that providers generally use tensor parallelism as much as they can, for all models. That usually means 8x or so.

In reality, tps ends up being a pretty good proxy for active param size when comparing different models at the same inference provider.

Replies

fc417fc802 • today at 9:53 AM

Oh I see. I went and confused total aggregate throughput with per-query throughput there didn't I.

alt Hacker News

Replies