What's the advantage of ds4 over llama.cpp, esp if down the line they upstream his forked kernels?
Currently, llama.cpp clusters don't support tensor parallelism, have a look at Donato Capitella's detailed report: https://m.youtube.com/watch?v=PkKXm_mKCCM He also provides rocm toolboxes for Strix Halo: https://strix-halo-toolboxes.com/#about
I think mainly that he can move much faster with specific improvements targeting Deepseek on Systems with unified memory (Mac or Strix). It's a lot easier to optimize if you don't need to worry about all the other architectures. So optimize he did and it's just a lot faster than llama cpp for deepseek v4 pro and flash. Also interesting features are more doable, like SSD streaming, which makes it possible to load MOE weights for a model larger than your VRAM, I don't see that landing in llama cpp anytime soon.
IIRC llama.cpp doesn't implement DSv4's compressed attention mechanism, and while it does use (credited) parts of llama.cpp, it's focused on this great model for now. Much of this is covered better in the repo's readme.