Btw, a few data points:
1. DS4F can run on a 128GB MacBook. M2.7 is larger (8 bit weights of routed experts). There is to see how it holds at 4 bits. At 2 bits it may not work well at all.
2. Just the KV cache of M2.7 would take ~50GB for 200k tokens AFAIK. It does not have the compressed KV cache that DS4F features.
3. The models are very similar in performances, despite all that. And DS4F is likely getting an update soon.
So it is basically a quasi-frontier model that can run on a 96/128GB MacBook at large context windows. That's non trivial. Likely a coding version could be released in the future.
>1. DS4F can run on a 128GB MacBook. M2.7 is larger (8 bit weights of routed experts). There is to see how it holds at 4 bits. At 2 bits it may not work well at all.
M2.7 is smaller than DS4, 230B total params vs 284B total params. At any given quantization level, M2.7 will require ~19% less memory for the weights than DS4F at the same quantization level. Both can be quantized to arbitrary precision levels. Larger models like these quantize much better at lower precision than smaller models do. There is still loss, but it's less catastrophic in terms of usability degradation than for say, 27B or 14B or 8B models. Again, n=1, but M2.7 holds up phenomenally well for me with unsloth's IQ2_XXS UD.
>2. Just the KV cache of M2.7 would take ~50GB for 200k tokens AFAIK. It does not have the compressed KV cache that DS4F features.
KV cache weights can also be quantized. At Q8_0, this is essentially lossless. I can fit a 400k context window with Q8_0 KV cache quantization along with unsloth's IQ2_XXS UD weight quantization (plus my running OS) on a machine with just 128 GB of unified memory. Strix Halo, not Apple Silicon. There are more exotic approaches to KV cache quantization with much higher efficiency, like TurboQuant, but this is besides the point.
>3. The models are very similar in performances, despite all that. And DS4F is likely getting an update soon.
Yes, though it's worth noting that DS4F does require about 20% more total memory for weights at any given quantization level (284B vs 230B), will need to shuffle about 30% more data through the pipeline on every forward pass (A13B vs A10B), has much higher hallucination rates per AA, and hasn't been fully post-trained. DS4 isn't a base model, it has been instruct trained, tool trained, etc, but there is a lot of capability that has been left on the table as of current checkpoints, which are what's actually available now.
>So it is basically a quasi-frontier model that can run on a 96/128GB MacBook at large context windows. That's non trivial. Likely a coding version could be released in the future.
MiniMax M2.7 fits into this same box - quasi-frontier model that can run on 96/128GB unified memory platforms with a large context window. You're right that it's non-trivial. My preference comes in part from the fact that M2.7 already is coding focused, and had been out for almost 2 months before DS4F showed up.
By the way, in spite of my preference for M2.7 over DS4F (and for Vulkan over ROCm on my hardware), I'm a big fan of your work on DarkStar 4. I admire what you've achieved with the project, how much work you've put into it, and your willingness to share that with the world, too. Thank you for your contributions to the open LLM ecosystem.