logoalt Hacker News

bigyabaiyesterday at 12:25 AM1 replyview on HN

> When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible.

This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.


Replies

fancyfredbotyesterday at 2:32 PM

Normally people refer to the compute-bound phase as "prefill". Nothing wrong with saying it's building the kv cache though, it's accurate just unusual.