I ran R1-8B for both anemll[0] and mlx[1][2] models on an M4 Max. Prompt: "Tell me a long sto...

SparkyMcUnicorn • 05/04/2025 • 4 replies • view on HN

I ran R1-8B for both anemll[0] and mlx[1][2] models on an M4 Max.

Prompt: "Tell me a long story about the origins of 42 being the answer."

anemll: 9.3 tok/sec, ~500MB of memory used.

mlx 8bit: 31.33 tok/sec, ~8.5GB of memory used.

mlx bf16: 27.17 tok/sec, ~15.7GB of memory used.

Memory results are from activity monitor across any potentially involved processes, but I feel like I might missing something here...

[0] https://huggingface.co/anemll/anemll-DeepSeekR1-8B-ctx1024_0...

[1] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...

[2] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...

Replies

antirez • 05/04/2025

Thank you. Strange. If the memory numbers are accurate, it is so slow because likely layers are loaded from disk before inference of each layer or something like that, otherwise it could not do the inference of such model in 500MB. But if that's what it does, 33% of the speed would be already too fast, likely.

ericboehs • 05/05/2025

Interesting. Does this mean larger models could be ran on less memory? It looks like it uses 15-20x less memory. Could a 671B DeepSeek R1 be ran in just ~40-50GB of memory? It sounds like it'd be 1/3 as fast though (<1tk/sec).

anemll • 05/04/2025

What hardware are you on? Most models are memory bandwidth limited. ANE was limited to 64GB/s prior to M3 Max or M4 pro. If you are on M1, GPU will be significantly faster for 3-8B models due to memory bandwidth rather then ANE capabilities.

➕ show 1 reply

srigi • 05/05/2025

Can you add a recent build of llama.cpp (arm64) to the results pool? I'm really interested in comparing mlx to llama.cpp, but setting up the mlx seems too difficult for me to do by myself.

➕ show 1 reply

alt Hacker News

Replies