logoalt Hacker News

stingraycharlesyesterday at 12:53 PM4 repliesview on HN

Because then the second token only needs to be checked, not generated, as it’s already generated? And it’s much faster to generate multiple tokens at the same time than one at a time? Is that the idea?

I’m not an expert on LLMs, just a user.


Replies

tompyesterday at 2:25 PM

No, the parent is wrong.

Checking a token is the same as generating it.

The benefit however is in the next (third) token. After generating tokens 1 and 2 (in one turn), you start generating token 3 (and 4). You also get the “real” prediction for token 2. If the “real” prediction matches the MTP (Multi-Token Prediction) from previous turn, you have just generated 3 correct tokens (and another speculative). If not, you’ve now corrected token 2, but token 3 is wrong (it follows the wrong token 2) so you need ti generate it again.

show 1 reply
bdcsyesterday at 2:31 PM

It relies on an “unintuitive observation”[0] that you can run batches basically for free (up to a limit). So if you only run one inference, you batch it plus a lot of guesses and, if you guess right, can speed up the inference by the number of guesses. If you guess wrong, you're back to regular speed (and still fully correct).

[0] https://x.com/karpathy/status/1697318534555336961

namibjyesterday at 1:55 PM

Basically you can generate the next two tokens at once in the same matmul, and rollback to one-at-a-time when your generation said you guessed wrong (as that will mean the second of your pair you generated was generated based on revoked context).

Zacharias030today at 6:49 AM

yes, if you know the sequence of tokens ahead of time you can verify them about as quickly as you can generate one more token because of the parallelism benefits.

If you don’t know the future tokens though, then you can’t, and blind guessing of tokens is infeasible because the vocabulary contains circa 100k possible different tokens.