logoalt Hacker News

moffkalastyesterday at 10:07 AM4 repliesview on HN

Hmm but isn't the checking only required because the draft model is not the same model and can only speculate what the main one is thinking, hence the name? If the main model generates two tokens itself, then how can it be wrong about its own predictions?


Replies

jychangyesterday at 10:44 AM

Because if you generate token n+1 with all 48 layers of Qwen3-Next and 80 billion params, and also generate token n+2 with the 1 MTP layer at 2bil params... that n+2 token can be much lower quality than the n+1 token but mostly correct.

Let's say you have a model that generates the string "The 44th president of the United States is ___ ___". Your model will generate "Barack" as the n+1 token, and the MTP layer probably does a good enough job to generate "Obama" as the n+2 token (even though that MTP layer is a mere <2bil parameters in size). Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way.

show 1 reply
SonOfLilityesterday at 10:14 AM

If you ask me to guess an answer, I'll _usually_ produce the same answer as if I had time to think about it deeply, but not always...

show 1 reply
EMM_386yesterday at 12:20 PM

I believe it's something along these lines. The MTP head runs simultaneously and generates a probability list based on what it thinks the results will be, learned during training.

If n+1 = "Barack" then n+2 = "Obama" (confidence: 0.90) If n+1 = "The" then n+2 = "quick" (confidence: 0.45) If n+1 = "President" then n+2 = "Biden" (confidence: 0.75)

A threshold is set (say, as 90%) so that if the n+2 prediction is above that (as in the first example) it uses it without having to determine it with the main model. It's confident "enough".

show 1 reply
eldenringyesterday at 1:29 PM

the 2nd token is generated without knowing what token was chosen for the 1st token