This is not a valid argument. TPS is essentially QoS and can be adjusted; more GPUs allocated will r...

grayxu • today at 12:00 PM • 1 reply • view on HN

This is not a valid argument. TPS is essentially QoS and can be adjusted; more GPUs allocated will result in higher speed.

Replies

yorwba • today at 12:44 PM

There are sequential dependencies, so you can't just arbitrarily increase speed by parallelizing over more GPUs. Every token depends on all previous tokens, every layer depends on all previous layers. You can arbitrarily slow a model down by using fewer, slower GPUs (or none at all), though.

➕ show 2 replies

alt Hacker News

Replies