logoalt Hacker News

conradevlast Saturday at 6:58 PM1 replyview on HN

My understanding is that model throughput is fundamentally limited at some point by the fact that the ANE is less wide than the GPU.

At that point, the ANE loses because you have to split the model into chunks and only one fits at a time.


Replies

smpanarolast Saturday at 7:14 PM

What do you mean by less wide? The main bottleneck for transformers is memory bandwidth. ANE has a much lower ceiling than CPU/GPU (yes, despite unified memory).

Chunking is actually beneficial as long as all the chunks can fit into the ANE’s cache. It speeds up compilation for large network graphs and cached loads are negligible cost. On M1 the cache limit is 3-4GB, but it is higher on M2+.

show 1 reply