> When it comes to inference speed, you want your model to fit in memory, and then to have as muc...

bigyabai • yesterday at 12:25 AM • 1 reply • view on HN

> When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible.

This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.

Replies

fancyfredbot • yesterday at 2:32 PM

Normally people refer to the compute-bound phase as "prefill". Nothing wrong with saying it's building the kv cache though, it's accurate just unusual.

alt Hacker News

Replies