CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has a...

zozbot234 • yesterday at 2:39 PM • 1 reply • view on HN

CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck.

Replies

abhikul0 • yesterday at 3:09 PM

I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on.

➕ show 1 reply

alt Hacker News

Replies