You in fact can now! In the past week, a transformer framework called FastFlowLM [0] supporting XDNA 2 NPUs officially started supporting Linux.
I posted it here the same day I found and started using it, to almost no reaction.
[0] https://github.com/FastFlowLM https://fastflowlm.com/ https://huggingface.co/FastFlowLM
I see it making claims about 10x efficiency, but how is tokens / second / watt? The only machines I've seen with the memory bandwidth to effectively do local inference are Mx arm chips on mac.
because it's not faster than the Ryzen 395's GPU. power efficiency doesn't matter as much as TTFT for desktop users, especially when they're tasking their AMD box as a dedicated inference machine.
some older pre-395 AMD articles suggested it'd be possible to use the NPU for prefill and the GPU for decoding and this would be faster than using either alone, but we have yet to see that (even on Windows) for any usefully sized models, just toys like LLaMA-8B.
> to almost no reaction.
HN is overloaded with AI stuff, its hard to break through all the noise. I say this as someone very interested in AI. Even I skip some links because its just too much.