The reason you don't see more of this is because everyone does the math, realizes it's not...

Aurornis • today at 4:06 AM • 3 replies • view on HN

The reason you don't see more of this is because everyone does the math, realizes it's not a good deal, and then gives up on the idea.

There's a post at the top of /r/localllama about this exact math right now: https://www.reddit.com/r/LocalLLaMA/comments/1ubrcwj/tokenom...

TL;DR: Running GLM 5.2 is going to cost about $20K minimum, and that's going to be painfully slow compared to the cloud hosted versions. Even the estimates where the server is computing tokens 24/7 you can't break even for several years.

The only reason to run locally is if complete data privacy is your top concern. You pay a high premium for that.

Replies

wongarsu • today at 2:30 PM

If you invest the minimum to run the model, obviously that's more expensive per-token than investing the optimum to get the best price/performance tradeoff (which for GLM 5.2 is at least five times that figure)

If you can bring the load to run the model on close to optimal hardware 24/7 with multiple concurrent requests, and have reasonably cheap power and AC, you would break even in a reasonable timespan. Which won't happen unless you are self-hosting for a medium-sized company. I guess you could sell your spare capacity to get better utilization ... and we've reinvented hosted inference

FridgeSeal • today at 7:40 AM

I mean sure, I’d you’re attempting to run the biggest possible models, it’s going to require a stupid amount of compute? I thought we all knew this?

The appeal to me is that we can run that, but we can also run smaller models on your laptop _and it’s functional!_ I can run DeepSeek v4 flash and a qwen 3.6 on my laptop! Thats crazy good.

pjc50 • today at 12:14 PM

.. conversely, all the cloud LLMs are being subsidized by their investors in addition to massive economies of scale.

➕ show 2 replies

alt Hacker News

Replies