logoalt Hacker News

reissbakerlast Monday at 8:15 PM1 replyview on HN

The local rig is not free and requires very large capital expenditures while producing very low token throughput for large models. Within any time budget, you can get many orders of magnitude more large-model tokens off an 8xB200 than off a local rig. Therefore cloud tokens have a huge capital efficiency advantage over local rigs. That will continue basically forever, since there will always be large cloud companies willing to spend millions of dollars for more capital-efficient hardware, so Nvidia and friends will continue to spare no expense producing it, meaning the cloud hardware will be way too expensive if you're not a large inference company. You can also buy local rigs, but they will be less capital efficient per token, not more.

(This is a generous argument: it also ignores the massive software stack optimization the cloud companies do that doesn't trickle down to local-rig-sized deployments; for example, prefill/decode disaggregation, which would double the VRAM requirements for a local rig — if you could even do it on a local rig, which you can't, because local rigs don't have Infiniband. But at scale, prefill/decode disaggregation improves capital efficiency, since you can tune the compute-bound prefill node differently than the memory-bound decode node.)

The advantage of local rigs is not capital-efficient tokens. It's privacy. But then again, you can get zero-data-retention options from many inference companies, so for many use cases it may not matter unless you need strict guarantees the data never leaves the building...


Replies

zozbot234last Monday at 10:21 PM

> The local rig is not free and requires very large capital expenditures while producing very low token throughput for large models.

Sometimes it really is free though, because the hardware was bought to serve some other existing needs and that capital expense was fully depreciated quite some time ago. Underutilised hardware is essentially ubiquitous.

> Within any time budget, you can get many orders of magnitude more large-model tokens off an 8xB200 than off a local rig.

But using that 8xB200 setup to run inference on cheap, non-frontier models is a plain waste. Its highest and best use is in an AI datacenter serving exceptionally smart models like Gemini DeepThink, GPT Pro or Claude Mythos. (If this isn't true, it means that the current level of large-scale investment in frontier, super intelligent AI is misplaced, and you should worry about that; not whether some models are best ran on lower-end hardware!)