logoalt Hacker News

grayxutoday at 12:00 PM1 replyview on HN

This is not a valid argument. TPS is essentially QoS and can be adjusted; more GPUs allocated will result in higher speed.


Replies

yorwbatoday at 12:44 PM

There are sequential dependencies, so you can't just arbitrarily increase speed by parallelizing over more GPUs. Every token depends on all previous tokens, every layer depends on all previous layers. You can arbitrarily slow a model down by using fewer, slower GPUs (or none at all), though.

show 2 replies