This makes no sense. It's not like they have a "slow it down" knob, they're prob...

stavros • last Saturday at 10:12 PM • 2 replies • view on HN

This makes no sense. It's not like they have a "slow it down" knob, they're probably parallelizing your request so you get a 2.5x speedup at 10x the price.

Replies

brookst • yesterday at 12:19 AM

All of these systems use massive pools of GPUs, and allocate many requests to each node. The “slow it down” knob is to steer a request to nodes with more concurrent requests; “speed it up” is to route to less-loaded nodes.

➕ show 1 reply

landl0rd • yesterday at 4:03 AM

What they are probably doing is speculative decoding, given they've mentioned identical distribution at 2.5x speed. That's roughly in the range you'd expect for that to achieve; 10x is not.

It's also absolute highway robbery (or at least overly-aggressive price discrimination) to charge 6x for speculative decoding, by the way. It is not that expensive and (under certain conditions, usually very cheap drafter and high acceptance rate) actually decrease total cost. In any case, it's unlikely to be even a 2x cost increase, let alone 6x.

alt Hacker News

Replies