logoalt Hacker News

astrostlyesterday at 10:59 PM2 repliesview on HN

> MacBook Pro M5 128GB RAM

614 GB/s of memory bandwidth

> MacMini M4 with 64GB of RAM

273 GB/s of memory bandwidth (also only currently available with 48GB)

When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. In this case a hypothetical Mini with 1TB of memory would still be over 2x slower with 27-35B models.

And FWIW I have an M4 Max MBP 128GB that I keep on a Roost laptop stand, with a separate keyboard/mouse/video. It does fire up the cooling jets when running local LLMs, but stays within tolerance for me on noise. I haven't heat-tested it on longer runs, but I imagine the risen airflow helps a ton.


Replies

iagooartoday at 7:18 AM

On paper the M4 should be roughly 1/3 of the M5, in practice it is only 1/2. With the right, optimized model like qwen3.6 35B MoE MLX you can get over 40 tok / sec on it. I run dozens of background jobs that are not time-critical on it.

show 1 reply
bigyabaitoday at 12:25 AM

> When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible.

This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.

show 1 reply