logoalt Hacker News

gpugregyesterday at 9:02 AM0 repliesview on HN

Serving a single user is likely not profitable, but total throughput rises a lot when serving many concurrent users, because the same weights can be used to generate tokens for all users at once, which increases efficiency.

Also, a lot of money is being made on input tokens and cached tokens, which are much cheaper to compute.

DeepSeek published their math for serving the V3/R1 models. They were 535% profitable: https://github.com/deepseek-ai/open-infra-index/blob/main/20...