> What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...
There is no reason that it couldn’t be beneficial for training though.