For larger contexts, the bottleneck is probably token prefill instead of memory bandwidth. Supposedl...

bigyabai • yesterday at 5:15 PM • 0 replies • view on HN

For larger contexts, the bottleneck is probably token prefill instead of memory bandwidth. Supposedly prefill is faster on the M5+ GPUs, but still a big hurdle for pre-M5 chips.

alt Hacker News