This sorta reminds me of the lie that was pushed when the Snapdragon X laptops were being released last year. Qualcomm implied the NPU would be used for LLMs — and I bought into the BS without looking into it. I still use a Snapdragon laptop as my daily driver (it's fine) but for running models locally, it's still a joke. Despite Qualcomm's claims about running 13B parameter models, software like LM Studio only runs on CPU with NPU support merely "planned for future updates." XDA The NPU isn't even faster than the CPU for LLMs — it's just more power-efficient for small models, not the big ones people actually want to run. Their GPUs aren't much better for this purpose either. The only hope for LLMs is the Vulkan support on the Snapdragon X — which still is half-baked.
I always felt that the neural engine was wasted silicon, they could add more gpu cores in that die space and redirect the neural processing api to the gpu as needed. But I'm no expert, so if anyone here has a different opinion I'd love to learn from it.
The README lacks the most important thing: how many more tokens/sec at the same quantization, compared to llama.cpp / MLX? It is worth to switch default platforms only if there is a major improvement.
I'm trying to figure out what the secret sauce for this is. It depends on https://github.com/apple/coremltools - is that the key trick or are there other important techniques going on here?
Apple is a competitive choice simply because their unified memory allows you to get enough ram that would take multiple Gpus to have enough space to run larger models.
But coreml utilizes ANE, right? Is there some bottleneck in coreml that requires lower level access?
Man, Apple's tight grip on ANE is kinda nuts - would love to see the day they let folks get real hands-on. you ever think companies hold stuff back just to keep control, or is there actually some big tech reason for it?
Is there a performance benefit for inference speed on M-series MacBooks, or is the primary task here simply to get inference working on other platforms (like iOS)? If there is a performance benefit, it would be great to see tokens/s of this vs. Ollama.
I am curious if anyone knows if the neural cores in apple silicon based machines are at all useful in training? I’ve been using the MLX framework but haven’t seen them mentioned anywhere so I’m just wondering if they are only useful for inference? I know whisper.cpp takes advantage of them in the inference context.
Edit: I changed llama.cpp to whisper.cpp - I didn’t realize that llama.cpp doesn’t have a coreml option like whisper.cpp does.
Yet Siri is still dumber than a doorknob…
What about Microsoft copilot AI notebooks? Would they run anything quick enough to be useful?
Even 1gb model is prohibitively big for phones if you want mass adoption.
[dead]
[dead]
[dead]
Getting anything to work on Apple proprietary junk is such a chore.
btw, don't bother trying to buy a bunch of Mac boxes to run LLMs in parallel because it won't be any faster than a single box.
I wonder if Apple ever followed up with this: https://github.com/apple/ml-ane-transformers
They claim their ANE-optimized models achieve "up to 10 times faster and 14 times lower peak memory consumption compared to baseline implementations."
AFAIK, neither MLX nor llama.cpp support ANE. Though llama.cpp started exploring this idea [0].
What's weird is that MLX is made by Apple and yet, they can't support ANE given its closed-source API! [1]
[0]: https://github.com/ggml-org/llama.cpp/issues/10453
[1]: https://github.com/ml-explore/mlx/issues/18#issuecomment-184...