What is everyone running DeepSeek v4 Flash with?!
It’s currently unsupported on Llama.cpp and vllm doesn’t support GPU+CPU MoE, so unless all of you have an array of DGX Sparks in your bedroom, what’s the secret sauce?!
you can run it today with mlx if you have 256g or 512g mac studio. no "antirez" fork needed.
it isn't that large of a model and the compressed kv implementation is not that complicated
the problem is that they released the model in a quantized format that is more complex than it appears, and people make a lot of mistakes working with it. it is quantization-aware-trained, so you can't "just" upscale it and scale down.
vllm runs dsv4 flash fine right right now
dgx sparks cannot really run it correctly right now with released vllm but there are PRs, it's just a matter of time. you would need 3 of them. they will still be almost 1/2 as fast as the mac studio.
so the punchline is, well, this is why the 512g mac studio is such a hot commodity right now.
https://www.github.com/antirez/ds4 (from Antirez of Redis fame) runs a 2-bit quant on Apple Silicon hardware and 96GB or 128GB RAM.