Same reason why prompt processing is faster than text generation. When you already know the tokens...

nodja • last Tuesday at 8:57 PM • 1 reply • view on HN

Same reason why prompt processing is faster than text generation.

When you already know the tokens ahead of time you can calculate the probabilities of all tokens batched together, incurring significant bandwidth savings. This won't work if you're already compute bound so people with macs/etc. won't get as much benefits from this.

Replies

Majromax • last Wednesday at 1:30 AM

Are Macs/etc compute bound with their 'it fits in unified memory' language models? Certainly by the time you're streaming weights from SSD you must be back in a bandwidth-bound regime.

➕ show 2 replies

alt Hacker News

Replies