> isn't a perfect fit for these algorithms but it's relatively close
I don't think that's true. The best fit out of what's presently available perhaps. Inference is almost entirely memory bandwidth bound at present, to the extent that GPUs with HBM have a massive advantage over those with GDDR. TPUs appear to be a much better overall design.
I expect that a hypothetical advance in fabrication enabling processing elements to be placed directly adjacent to dense RAM on the same silicon (not merely in the same package) would be superior in all regards.
> I expect that a hypothetical advance in fabrication enabling processing elements to be placed directly adjacent to dense RAM on the same silicon (not merely in the same package) would be superior in all regards.
Processing scales better than DRAM does. I think an HBM-like stack where the bottom layer has the math units is probably the ultimate form of that.
And it's possible that flash instead of DRAM is actually the better play, as long as you can hook up enough in parallel. RIP Optane.