logoalt Hacker News

SparkyMcUnicorntoday at 12:28 AM2 repliesview on HN

I ran R1-8B for both anemll[0] and mlx[1][2] models on an M4 Max.

Prompt: "Tell me a long story about the origins of 42 being the answer."

anemll: 9.3 tok/sec, ~500MB of memory used.

mlx 8bit: 31.33 tok/sec, ~8.5GB of memory used.

mlx bf16: 27.17 tok/sec, ~15.7GB of memory used.

Memory results are from activity monitor across any potentially involved processes, but I feel like I might missing something here...

[0] https://huggingface.co/anemll/anemll-DeepSeekR1-8B-ctx1024_0...

[1] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...

[2] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...


Replies

antireztoday at 8:15 AM

Thank you. Strange. If the memory numbers are accurate, it is so slow because likely layers are loaded from disk before inference of each layer or something like that, otherwise it could not do the inference of such model in 500MB. But if that's what it does, 33% of the speed would be already too fast, likely.

anemlltoday at 4:23 AM

What hardware are you on? Most models are memory bandwidth limited. ANE was limited to 64GB/s prior to M3 Max or M4 pro. If you are on M1, GPU will be significantly faster for 3-8B models due to memory bandwidth rather then ANE capabilities.

show 1 reply