A realistic setup for this would be a 16× H100 80GB with NVLink. That comfortably handles the active 32B experts plus KV cache without extreme quantization. Cost-wise we are looking at roughly $500k–$700k upfront or $40–60/hr on-demand, which makes it clear this model is aimed at serious infra teams, not casual single-GPU deployments. I’m curious how API providers will price tokens on top of that hardware reality.
You don't need to wait and see, Kimi K2 has the same hardware requirements and has several providers on OpenRouter:
https://openrouter.ai/moonshotai/kimi-k2-thinking https://openrouter.ai/moonshotai/kimi-k2-0905 https://openrouter.ai/moonshotai/kimi-k2-0905:exacto https://openrouter.ai/moonshotai/kimi-k2
Generally it seems to be in the neighborhood of $0.50/1M for input and $2.50/1M for output
Generally speaking, 8xH200s will be a lot cheaper than 16xH100s, and faster too. But both should technically work.
The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.
The weights are int4, so you'd only need 8xH100