Well…. The reason there’s such a big mismatch is the memory controller. Something like 80-90% of the energy is spent moving data in and out because of the complex addressing. If you move compute into the RAM and instead shuttle instructions in and out, you might get a huge speed up. The challenge is when an instruction references some data over there - that may end up eliminating all the advantage. But people I believe are trying to commercialize this concept.
> If you move compute into the RAM and instead shuttle instructions in and out, you might get a huge speed up.
Isn't that just a per-compute cache/local memory? You're proposing a scaled-up variety of NUMA where every compute core has its local memory and going outside that will cost you more.