What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
> What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...
It could be a better draft model than separately trained EAGLE etc for speculative decoding.
Speculative decoding! It makes inference a LOT faster.
Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.
If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.