This could have some interesting hardware implications as well - it suggests that a large dedicated silicon instruction set could accelerate any mathematical algorithm provided it can be mapped to this primitive. It also suggests a compiler/translation layer should be possible as well as some novel visualization methods for functions and methods.
This paper seems to suggest that a chip with 10 pipeline stages of EML units could evaluate any elementary function (table 4) in a single pass.
I'm curious how this would compare to the dedicated sse or xmx instructions currently inside most processor's instruction sets.
Lastly, you could also create 5-depth or 6-depth EML tree in hardware (fpga most likely) and use it in lieu of the rust implementation to discover weight-optimal eml formulas for input functions much quicker, those could then feed into a "compiler" that would allow it to run on a similar-scale interpreter on the same silicon.
In simple terms: you can imagine an EML co-processor sitting alongside a CPUs standard math coprocessor(s): XMX, SSE, AMX would do the multiplication/tile math they're optimized for, and would then call the EML coprocessor to do exp,sin,log calls which are processed by reconfiguring the EML trees internally to process those at single-cycle speed instead of relaying them back to the main CPU to do that math in generalized instructions - likely something that takes many cycles to achieve.
I'm not too familiar with the hardware world, but does EML look like the kind of computation that's hardware-friendly? Would love for someone with more expertise to chime in here.