logoalt Hacker News

zozbot234last Friday at 6:49 PM1 replyview on HN

As a typical example DeepSeek v4-pro has 59B active params at mostly FP4 size, so it needs to "find" around 30GB worth of params in RAM per inferred token. On a 512GB total RAM machine, most of those params will actually be cached in RAM (model size on disk is around 862GB), so assuming for the sake of argument that MoE expert selection is completely random and unpredictable, around 15GB in total have to be fetched from storage per token. If MoE selection is not completely random and there's enough locality, that figure actually improves quite a bit and inference becomes quite workable.


Replies

SwellJoeyesterday at 1:21 AM

I've never seen reports of this kind of setup being able to deliver more than low single-digit tokens per second. That's certainly not usable interactively, and only of limited utility for "leave it to think overnight" tasks. Am I missing something?

Also, I don't know of a general solution to streaming models from disk. Is there an inference engine that has this built-in in a way that is generally applicable for any model? I know (I mean, I've seen people say it, I haven't tried it) you can use swap memory with CPU offloading in llama.cpp, and I can imagine that would probably work...but definitely slowly. I don't know if it automatically handles putting the most important routing layers on the GPU before offloading other stuff to system RAM/swap, though. I know system RAM would, over time, come to hold the hottest selection of layers most of the time as that's how swap works. Some people seem to be manually splitting up the layers and distributing them across GPU and system RAM.

Have you actually done this? On what hardware? With what inference engine?