logoalt Hacker News

r0b05yesterday at 5:48 AM1 replyview on HN

It's an interesting point but local gpu efficiency is not something I think about when I'm being rate limited or when my subscription costs keep rising.


Replies

fleventynineyesterday at 3:57 PM

I think folks in this thread are underestimating how expensive it is to serve a SoTA model at 100 tokens a second. In addition to the $500k in capital costs, you also have significant electricity costs.

This stuff is expensive because supply is much lower than demand. If everyone was to run their own hardware with a batch size of 1, we'd have 100x more demand for inference hardware and electricity than we do now, and people would be even more frustrated. Efficiency is everything, and we need all the economies of scale we can get to meet demand.

show 1 reply