logoalt Hacker News

tompyesterday at 2:25 PM1 replyview on HN

No, the parent is wrong.

Checking a token is the same as generating it.

The benefit however is in the next (third) token. After generating tokens 1 and 2 (in one turn), you start generating token 3 (and 4). You also get the “real” prediction for token 2. If the “real” prediction matches the MTP (Multi-Token Prediction) from previous turn, you have just generated 3 correct tokens (and another speculative). If not, you’ve now corrected token 2, but token 3 is wrong (it follows the wrong token 2) so you need ti generate it again.


Replies

bigwheelsyesterday at 7:19 PM

Thanks for the clarification. Your comment made me connect the similarity (in spirit) of Speculative Decoding to Speculative Execution [1] in CPUs. Very cool and clever optimization strategy for LLMs, IMHO.

[1] https://en.wikipedia.org/wiki/Speculative_execution

Does it work to predict tokens 3 and 4 (or 5, 6) in the same way? I wonder how extreme the hit rate drop-off is.