I don't understand how you can compare against the base model output without generating with th...

awestroke • last Tuesday at 10:53 AM • 6 replies • view on HN

I don't understand how you can compare against the base model output without generating with the base model, in which case what's the point?

Replies

radarsat1 • last Tuesday at 1:30 PM

Because the nature of transformers is that running a bunch of pregenerated tokens through them is a parallel operation, not autoregressive. That's how it works at training time, but speculative decoding uses it at inference time. So if you just want to check whether a set of known tokens is "likely" given the base model, you can run them all through and get probability distributions, no need to sample.

It's the same reason there's a difference in speed between "prompt processing" and "generation". The former is just taking the pre-generated prompt and building the KV cache, which is parallel, not autoregressive and therefore way faster.

qeternity • last Tuesday at 11:44 AM

I haven't read TFA yet but a common technique is speculative decoding where a fast draft model will generate X tokens, which are then verified by the larger target model. The target model may accept some Y <= X tokens but the speedup comes from the fact that this can be done in parallel as a prefill operation due to the nature of transformers.

So let's say a draft model generates 5 tokens, all 5 of these can be verified in parallel with a single forward pass of the target model. The target model may only accept the first 4 tokens (or whatever) but as long as the 5 forward passes of the draft model + 1 prefill of the target model is faster than 4 forward passes of the target, you will have a speedup while maintaining the exact output distribution as the target.

nodja • last Tuesday at 8:57 PM

Same reason why prompt processing is faster than text generation.

When you already know the tokens ahead of time you can calculate the probabilities of all tokens batched together, incurring significant bandwidth savings. This won't work if you're already compute bound so people with macs/etc. won't get as much benefits from this.

➕ show 1 reply

Balinares • last Tuesday at 12:41 PM

Isn't that exactly how draft models speed up inference, though? Validating a batch of tokens is significantly faster than generating them.

anentropic • last Tuesday at 11:48 AM

presumably that happens at training time?

then once successfully trained you get faster inference from just the diffusion model

a1j9o94 • last Tuesday at 11:33 AM

You would only use the base model during training. This is a distillation technique

alt Hacker News

Replies