logoalt Hacker News

martinaldtoday at 12:00 AM1 replyview on HN

Yes 32B dense is a weird one to choose.

But in reality, 32B dense is very similar* to 32B activated on MoE in terms of inference costs. And I highly suspect eg Opus is around that level of active params.

A 284ba13b model at scale, is almost certainly cheaper to serve than a 32b dense model.

*as you can shard the model across multiple GPUs at scale. but in reality you have some loss of efficiency from GPU coordination and expert routing


Replies

breputtoday at 12:46 AM

That's good information. I couldn't possibly even start to run even DeepSeek Flash on my system, but also if you're assuming multiple GPUs, that is going to affect the napkin math.

show 1 reply