The README lacks the most important thing: how many more tokens/sec at the same quantization, compared to llama.cpp / MLX? It is worth to switch default platforms only if there is a major improvement.
Yeah, I looked all over for a comparison and couldn't find anything in the repo, on their social media, etc. I saw some other comments here that said it's supposed to be "15.8 fp16 ops compared to 14.7 fp32 ops" but that isn't really enough to go on. Maybe when I have the time I'll install their TestFlight app and do some comparisons myself.
I ran R1-8B for both anemll[0] and mlx[1][2] models on an M4 Max.
Prompt: "Tell me a long story about the origins of 42 being the answer."
anemll: 9.3 tok/sec, ~500MB of memory used.
mlx 8bit: 31.33 tok/sec, ~8.5GB of memory used.
mlx bf16: 27.17 tok/sec, ~15.7GB of memory used.
Memory results are from activity monitor across any potentially involved processes, but I feel like I might missing something here...
[0] https://huggingface.co/anemll/anemll-DeepSeekR1-8B-ctx1024_0...
[1] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...
[2] https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Lla...