Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.
Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...
But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.
Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob...
So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.
How is MTP different from Medusa heads? Also does this mean this model comes "natively" with speculative decoding - meaning if I use this model in vllm, it's throughput should be higher because it is already doing MTP so it should be able to take advantages of speculative decoding?
Could someone kindly point to a convenient all-on-one ELI5 of all these words? :')
What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?