The README lacks the most important thing: how many more tokens/sec at the same quantization, compared to llama.cpp / MLX? It is worth to switch default platforms only if there is a major improvement.
Yeah, I looked all over for a comparison and couldn't find anything in the repo, on their social media, etc. I saw some other comments here that said it's supposed to be "15.8 fp16 ops compared to 14.7 fp32 ops" but that isn't really enough to go on. Maybe when I have the time I'll install their TestFlight app and do some comparisons myself.
In my testing, tokens per sec is half the speed of the GPU, however the power usage is 10x less — 2 watts ANE vs 20 watts GPU on my M4 Pro.
I ran R1-8B for both anemll[0] and mlx[1][2] models on an M4 Max.
Prompt: "Tell me a long story about the origins of 42 being the answer."
anemll: 9.3 tok/sec, ~500MB of memory used.
mlx 8bit: 31.33 tok/sec, ~8.5GB of memory used.
mlx bf16: 27.17 tok/sec, ~15.7GB of memory used.
Memory results are from activity monitor across any potentially involved processes, but I feel like I might missing something here...
[0] https://huggingface.co/anemll/anemll-DeepSeekR1-8B-ctx1024_0...
[1] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...
[2] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...