I'm trying to figure out what the secret sauce for this is. It depends on https://github.com/apple/coremltools - is that the key trick or are there other important techniques going on here?
coremltools is the only way to run on ANE, so less of a trick and more of a requirement.
The tricks are more around optimizing for the hardware capabilities/constraints. For instance:
- conv2d is faster than linear (see Apple's post [0]) so you rewrite the model for that (example from the repo [1])
- inputs/outputs are static shapes, so KV cache requires some creativity (I wrote about that here [2])
- compute is float16 (not bfloat16) so occasionally you have to avoid activation overflows
[0]: https://machinelearning.apple.com/research/neural-engine-tra...
[1]: https://github.com/Anemll/Anemll/blob/4bfa0b08183a437e759798...
[2]: https://stephenpanaro.com/blog/kv-cache-for-neural-engine
coremltools is the only way to run on ANE, so less of a trick and more of a requirement.
The tricks are more around optimizing for the hardware capabilities/constraints. For instance:
- conv2d is faster than linear (see Apple's post [0]) so you rewrite the model for that (example from the repo [1])
- inputs/outputs are static shapes, so KV cache requires some creativity (I wrote about that here [2])
- compute is float16 (not bfloat16) so occasionally you have to avoid activation overflows
[0]: https://machinelearning.apple.com/research/neural-engine-tra...
[1]: https://github.com/Anemll/Anemll/blob/4bfa0b08183a437e759798...
[2]: https://stephenpanaro.com/blog/kv-cache-for-neural-engine