The guts of a LLM isn't something I'm well versed in, but > to get the first N tokens...

joha4270 • yesterday at 12:19 PM • 5 replies • view on HN

The guts of a LLM isn't something I'm well versed in, but

> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model

suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?

Replies

connorbrinton • yesterday at 12:48 PM

Speculative decoding takes advantage of the fact that it's faster to validate that a big model would have produced a particular sequence of tokens than to generate that sequence of tokens from scratch, because validation can take more advantage of parallel processing. So the process is generate with small model -> validate with big model -> then generate with big model only if validation fails

More info:

* https://research.google/blog/looking-back-at-speculative-dec...

* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...

➕ show 1 reply

speedping • yesterday at 12:46 PM

Verification is faster than generation, one forward pass for verification of multiple tokens vs a pass for every new token in generation

vanviegen • yesterday at 12:43 PM

I don't understand how it would work either, but it may be something similar to this: https://developers.openai.com/api/docs/guides/predicted-outp...

ml_basics • yesterday at 12:56 PM

They are referring to a thing called "speculative decoding" I think.

cma • yesterday at 12:42 PM

When you predict with the small model, the big model can verify as more of a batch and be more similar in speed to processing input tokens, if the predictions are good and it doesn't have to be redone.

alt Hacker News

Replies