The README lacks the most important thing: how many more tokens/sec at the same quantization, c...

antirez • 05/03/2025 • 3 replies • view on HN

The README lacks the most important thing: how many more tokens/sec at the same quantization, compared to llama.cpp / MLX? It is worth to switch default platforms only if there is a major improvement.

Replies

SparkyMcUnicorn • 05/04/2025

I ran R1-8B for both anemll[0] and mlx[1][2] models on an M4 Max.

Prompt: "Tell me a long story about the origins of 42 being the answer."

anemll: 9.3 tok/sec, ~500MB of memory used.

mlx 8bit: 31.33 tok/sec, ~8.5GB of memory used.

mlx bf16: 27.17 tok/sec, ~15.7GB of memory used.

Memory results are from activity monitor across any potentially involved processes, but I feel like I might missing something here...

[0] https://huggingface.co/anemll/anemll-DeepSeekR1-8B-ctx1024_0...

[1] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...

[2] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...

➕ show 4 replies

loh • 05/03/2025

Yeah, I looked all over for a comparison and couldn't find anything in the repo, on their social media, etc. I saw some other comments here that said it's supposed to be "15.8 fp16 ops compared to 14.7 fp32 ops" but that isn't really enough to go on. Maybe when I have the time I'll install their TestFlight app and do some comparisons myself.

sunpazed • 05/04/2025

In my testing, tokens per sec is half the speed of the GPU, however the power usage is 10x less — 2 watts ANE vs 20 watts GPU on my M4 Pro.

alt Hacker News

Replies