logoalt Hacker News

htkyesterday at 4:03 PM8 repliesview on HN

I always felt that the neural engine was wasted silicon, they could add more gpu cores in that die space and redirect the neural processing api to the gpu as needed. But I'm no expert, so if anyone here has a different opinion I'd love to learn from it.


Replies

lucasoshiroyesterday at 4:24 PM

I'm not a ML guy, but when I needed to train a NN I thought that the my Mac's ANE would help. But actually, despite it being way easier to setup tensorflow + metal + M1 on Mac than to setup tensorflow + cuda + nvidia on Linux, the neural engine cores are not used. Not even for classification, which are their main purpose. I wouldn't say they are wasted silicon, but they are way less useful than what we expect

show 1 reply
brigadeyesterday at 6:10 PM

Eyeballing 3rd party annotated die shots [1], it’s about the size of two GPU cores, but achieves 15.8 tflops. Which is more than the reported 14.7 tflops of the 32-core GPU in the binned M4 Max.

[1] https://vengineer.hatenablog.com/entry/2024/10/13/080000

show 1 reply
xiphias2yesterday at 5:19 PM

I guess it's a hard choice as it's 5x more energy efficient than GPU because it uses systolic array.

For laptops, 2x GPU cores would make more sense, for phones/tablets, energy efficiency is everything.

1W6MIC49CYX9GAPyesterday at 5:16 PM

You're completely right, if you already have a GPU in a system adding tensor cores to it gives you better performance per area.

GPU + dedicated AI HW is virtually always the wrong approach compared to GPU+ tensor cores

ks2048yesterday at 5:15 PM

At least one link/benchmark I saw said the ANE can be 7x faster than GPU (Metal / MPS),

https://discuss.pytorch.org/t/apple-neural-engine-ane-instea...

It seems intuitive that if they design hardware very specifically for these applications (beyond just fast matmuls on a GPU), they could squeeze out more performance.

show 1 reply
rz2kyesterday at 7:17 PM

I was trying to figure the same thing out a couple months ago, and didn't find much information.

It looked like even ANEMLL provides limited low level access to specifically direct processing toward the Apple Neural Engine, because Core ML still acts as the orchestrator. Instead, flags during conversion of a PyTorch or TensorFlow model can specify ANE-optimized operations, quantization, and parameters hinting at compute targets or optimization strategies. For example `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine` during conversion would disfavor the GPU cores.

Anyway, I didn't actually experiment with this, but at the time I thought maybe there could be a strategy of creating a speculative execution framework, with a small ANE-compatible model to act as the draft model paired with a larger target model running on GPU cores. The idea being that the ANE's low latency and high efficiency could accelerate results.

However, I would be interested to hear the perspective of people who actually know something about the subject.

bigyabaiyesterday at 4:13 PM

If you did that, you'd stumble into the Apple GPU's lack of tensor acceleration hardware. For an Nvidia-like experience you'd have to re-architecture the GPU to subsume the NPU's role, and if that was easy then everyone would have done it by now.

show 2 replies