logoalt Hacker News

stavroslast Saturday at 10:12 PM2 repliesview on HN

This makes no sense. It's not like they have a "slow it down" knob, they're probably parallelizing your request so you get a 2.5x speedup at 10x the price.


Replies

brookstyesterday at 12:19 AM

All of these systems use massive pools of GPUs, and allocate many requests to each node. The “slow it down” knob is to steer a request to nodes with more concurrent requests; “speed it up” is to route to less-loaded nodes.

show 1 reply
landl0rdyesterday at 4:03 AM

What they are probably doing is speculative decoding, given they've mentioned identical distribution at 2.5x speed. That's roughly in the range you'd expect for that to achieve; 10x is not.

It's also absolute highway robbery (or at least overly-aggressive price discrimination) to charge 6x for speculative decoding, by the way. It is not that expensive and (under certain conditions, usually very cheap drafter and high acceptance rate) actually decrease total cost. In any case, it's unlikely to be even a 2x cost increase, let alone 6x.