A realistic setup for this would be a 16× H100 80GB with NVLink. That comfortably handles the active...

Barathkanna • today at 9:55 AM • 4 replies • view on HN

A realistic setup for this would be a 16× H100 80GB with NVLink. That comfortably handles the active 32B experts plus KV cache without extreme quantization. Cost-wise we are looking at roughly $500k–$700k upfront or $40–60/hr on-demand, which makes it clear this model is aimed at serious infra teams, not casual single-GPU deployments. I’m curious how API providers will price tokens on top of that hardware reality.

Replies

wongarsu • today at 11:40 AM

The weights are int4, so you'd only need 8xH100

a2128 • today at 1:05 PM

You don't need to wait and see, Kimi K2 has the same hardware requirements and has several providers on OpenRouter:

https://openrouter.ai/moonshotai/kimi-k2-thinking https://openrouter.ai/moonshotai/kimi-k2-0905 https://openrouter.ai/moonshotai/kimi-k2-0905:exacto https://openrouter.ai/moonshotai/kimi-k2

Generally it seems to be in the neighborhood of $0.50/1M for input and $2.50/1M for output

reissbaker • today at 10:37 AM

Generally speaking, 8xH200s will be a lot cheaper than 16xH100s, and faster too. But both should technically work.

➕ show 1 reply

bertili • today at 10:04 AM

The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.

➕ show 3 replies

alt Hacker News

Replies