Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP wi...

jychang • yesterday at 8:21 AM • 3 replies • view on HN

Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.

Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...

But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.

Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob...

So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.

Replies

puilp0502 • yesterday at 8:44 AM

What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?

➕ show 3 replies

humblyCrazy • yesterday at 2:04 PM

How is MTP different from Medusa heads? Also does this mean this model comes "natively" with speculative decoding - meaning if I use this model in vllm, it's throughput should be higher because it is already doing MTP so it should be able to take advantages of speculative decoding?

Razengan • yesterday at 12:11 PM

Could someone kindly point to a convenient all-on-one ELI5 of all these words? :')

➕ show 5 replies

alt Hacker News

Replies