There is a huge market for "its faster" at the cost of efficiency, but I don't think ...

tgtweak • yesterday at 8:19 PM • 1 reply • view on HN

There is a huge market for "its faster" at the cost of efficiency, but I don't think your claim that an EML hardware block would be inherently less inefficient than the same workload running on a GPU. If you think it would be, back it up with some numbers.

A 10-stage EML pipeline would be about the size of an avx-512 instruction block on a modern CPU, in the realm of ~0.1mm2 on a 5nm process node (collectively including the FMA units behind it), at it's entirety about 1% of the CPU die. None of this suggests that even a ~500 wide 10-stage EML pipeline would be consuming anywhere near the power of a modern datacenter GPU (which wastes a lot of it's energy moving things from memory to ALU to shader core...).

Not sure if you're arguing from a hypothetical position or practical one but you seem to be narrowing your argument to "well for simple math it's less efficient" but that's not the argument being made at all.

Replies

tripletao • yesterday at 9:37 PM

> you seem to be narrowing your argument to "well for simple math it's less efficient" but that's not the argument being made at all.

What? Unless the thing you want to compute happens to be exactly that eml() function (no multiplication, no addition, no subtraction unless it's an exponential minus a log, etc.) or almost so, it is unquestionably less efficient. If you believe otherwise, then please provide the eml() implementation of a practically useful function of your choice (e.g. that Arrhenius rate). Then we can count the superfluous transcendental function evaluations vs. a conventional implementation, and try to understand what benefit could outweigh them.

> A 10-stage EML pipeline would be about the size of an avx-512 instruction block on a modern CPU

Can you explain where you got that conclusion? And what do you think a "10-stage EML pipeline" would be useful for? Remember that the multiply embedded in your Arrhenius rate is already 8 layers and 12 operations.

Also, can you confirm whether you're working with an LLM here? You're making a lot of unsupported and oddly specific claims that don't make sense to me, and I'm trying to understand where they're coming from.

alt Hacker News

Replies