logoalt Hacker News

Run LLMs on Apple Neural Engine (ANE)

247 pointsby behnamohyesterday at 3:29 PM100 commentsview on HN

Comments

behnamohyesterday at 4:15 PM

I wonder if Apple ever followed up with this: https://github.com/apple/ml-ane-transformers

They claim their ANE-optimized models achieve "up to 10 times faster and 14 times lower peak memory consumption compared to baseline implementations."

AFAIK, neither MLX nor llama.cpp support ANE. Though llama.cpp started exploring this idea [0].

What's weird is that MLX is made by Apple and yet, they can't support ANE given its closed-source API! [1]

[0]: https://github.com/ggml-org/llama.cpp/issues/10453

[1]: https://github.com/ml-explore/mlx/issues/18#issuecomment-184...

show 6 replies
cowmixyesterday at 5:00 PM

This sorta reminds me of the lie that was pushed when the Snapdragon X laptops were being released last year. Qualcomm implied the NPU would be used for LLMs — and I bought into the BS without looking into it. I still use a Snapdragon laptop as my daily driver (it's fine) but for running models locally, it's still a joke. Despite Qualcomm's claims about running 13B parameter models, software like LM Studio only runs on CPU with NPU support merely "planned for future updates." XDA The NPU isn't even faster than the CPU for LLMs — it's just more power-efficient for small models, not the big ones people actually want to run. Their GPUs aren't much better for this purpose either. The only hope for LLMs is the Vulkan support on the Snapdragon X — which still is half-baked.

show 3 replies
htkyesterday at 4:03 PM

I always felt that the neural engine was wasted silicon, they could add more gpu cores in that die space and redirect the neural processing api to the gpu as needed. But I'm no expert, so if anyone here has a different opinion I'd love to learn from it.

show 8 replies
antirezyesterday at 5:43 PM

The README lacks the most important thing: how many more tokens/sec at the same quantization, compared to llama.cpp / MLX? It is worth to switch default platforms only if there is a major improvement.

show 2 replies
simonwyesterday at 4:04 PM

I'm trying to figure out what the secret sauce for this is. It depends on https://github.com/apple/coremltools - is that the key trick or are there other important techniques going on here?

show 1 reply
daedrdevyesterday at 5:39 PM

Apple is a competitive choice simply because their unified memory allows you to get enough ram that would take multiple Gpus to have enough space to run larger models.

show 1 reply
jvstokesyesterday at 7:03 PM

But coreml utilizes ANE, right? Is there some bottleneck in coreml that requires lower level access?

show 1 reply
gitroomyesterday at 7:06 PM

Man, Apple's tight grip on ANE is kinda nuts - would love to see the day they let folks get real hands-on. you ever think companies hold stuff back just to keep control, or is there actually some big tech reason for it?

show 1 reply
cwoolfeyesterday at 5:46 PM

Is there a performance benefit for inference speed on M-series MacBooks, or is the primary task here simply to get inference working on other platforms (like iOS)? If there is a performance benefit, it would be great to see tokens/s of this vs. Ollama.

show 1 reply
kamranjonyesterday at 4:07 PM

I am curious if anyone knows if the neural cores in apple silicon based machines are at all useful in training? I’ve been using the MLX framework but haven’t seen them mentioned anywhere so I’m just wondering if they are only useful for inference? I know whisper.cpp takes advantage of them in the inference context.

Edit: I changed llama.cpp to whisper.cpp - I didn’t realize that llama.cpp doesn’t have a coreml option like whisper.cpp does.

show 2 replies
randmeerkattoday at 2:11 AM

Yet Siri is still dumber than a doorknob…

nevestoday at 3:17 AM

What about Microsoft copilot AI notebooks? Would they run anything quick enough to be useful?

m3kw9yesterday at 6:30 PM

Even 1gb model is prohibitively big for phones if you want mass adoption.

show 2 replies
shihabkhanbdyesterday at 5:30 PM

[dead]

nvk255yesterday at 5:42 PM

[dead]

shihabkhanbdyesterday at 5:31 PM

[dead]

xystyesterday at 6:55 PM

Getting anything to work on Apple proprietary junk is such a chore.

neuroelectronyesterday at 5:56 PM

btw, don't bother trying to buy a bunch of Mac boxes to run LLMs in parallel because it won't be any faster than a single box.

show 1 reply