> performance often degrades under different chat templates, long-context inputs, or out-of-distr...

kbumsik • today at 2:30 PM • 1 reply • view on HN

> performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts.

I heard that speculative decoding doesn't affect performance (I meant accuracy). Am I wrong about it?

Replies

ketchup32613 • today at 2:41 PM

You're not wrong about that. Speculative decoding does not affect the quality of tokens generated, as each token has to be verified by the parent model before it is output.

Each of the tokens generated by the draft model has to be verified by the parent/original model, but if this acceptance rate falls, then the speedup from speculative decoding would be eliminated. This acceptance rate, and more directly the speedup from draft models, is what "performance" refer s to in the article.

➕ show 1 reply

alt Hacker News

Replies