I wonder if Apple ever followed up with this: https://github.com/apple/ml-ane-transformers
They claim their ANE-optimized models achieve "up to 10 times faster and 14 times lower peak memory consumption compared to baseline implementations."
AFAIK, neither MLX nor llama.cpp support ANE. Though llama.cpp started exploring this idea [0].
What's weird is that MLX is made by Apple and yet, they can't support ANE given its closed-source API! [1]
[0]: https://github.com/ggml-org/llama.cpp/issues/10453
[1]: https://github.com/ml-explore/mlx/issues/18#issuecomment-184...
One update from Apple Research since that one was "Deploying Attention-Based Vision Transformers to Apple Neural Engine" (but not clear the relationship. It doesn't build on ane_transformers, but maybe a sister project for vision?)
blog: https://machinelearning.apple.com/research/vision-transforme...
Not a public follow-up but the iOS 17 speech-to-text model has a clever approach to KV caching that works within the ANE’s constraints (fixed size inputs).
I wrote about it here[0] but the gist is you can have a fixed size cache and slide it in chunks with each inference. Not as efficient as a cache that grows by one each time of course.
[0]: https://stephenpanaro.com/blog/inside-apples-2023-transforme...
Onnxruntime supports CoreML, though if my experience with converting an embedding model to CoreML using Apple's CoreML conversion tool is similar to the ORT maintainers', I can see why it would be unmaintained.
It took multiple tries to get the model to convert at all to the mlpackage format, and then a lot of experimenting to get it to run on the ANE instead of the GPU, only to discover that constant reshaping was killing any performance benefit (either you have a fixed multiplication size or don't bother), and even at a fixed size and using the attention mask, its operations were slower than saturating the GPU with large batches.
I discovered an issue where using the newer iOS 18 standard would cause the model conversion to break, and put an issue in on their GitHub, including an example repository for easy replication. I got a response quickly, but almost a year later, the bug is still unfixed.
Even when George Hotz attempted to hack it to use it without Apple's really bad and unmaintained CoreML library, he gave up because it was impossible without breaking some pretty core OS features (certificate signing IIRC).
The ANE/CoreML is just not serious at all about making their hardware usable at all. Even Apple's internal MLX team can't crack that nut.
Based on the graphs "up to 10 times faster" compares before/after flash attention
This more than anything feels emblematic to me, that Apple executives are brain dead when it comes to software. AI seemingly being a step too far. (in the s/w direction.) While they could at some level grok classical s/w, NN-s are that Terra Incognita where Apple executives can not possibly follow in. It's just too strange and mysterious world for them to be able to effectively decide or execute on anything. Worked (20yrs ago) in an industrial R&D lab of a mostly h/w manufacturer for 3 years. It looked to me that the worlds of h/w and s/w, the mindsets, diverged pretty quickly on all important considerations of "what should happen next".
Whisper.cpp has a coreml option which gives 3x speed up over cpu only according to the docs: https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#c...