For basic arithmetic, this is not required nor would it be faster, as it is not likely advantageous ...

tgtweak • yesterday at 5:31 PM • 1 reply • view on HN

For basic arithmetic, this is not required nor would it be faster, as it is not likely advantageous for bulk static transcendal functions. Where this becomes interesting is when combining them OR when chaining them where today they must come back out to the main process for reconfiguration and then re-issued.

Practical terms: Jacobian (heavily used in weather and combustion simulation): The transcendental calls, mostly exp(-E_a/RT), are the actual clock-cycle bottleneck. The GPU's SFU computes one exp2 at a time per SM. The ALU then has to convert it (exp(x) = exp2(x × log2(e))), multiply by the pre-exponential factor, and accumulate partial derivatives. It's a long serial chain for each reaction rate.

The core of this is the Arrhenius rate, (A × T^n × exp(-E_a/(R×T))), which involves an exponentiation, a division, a multiplication, and an exponential. On a GPU, that's multiple SFU calls chained with ALU ops. In an EML tree, the whole expression compiles to a single tree that flows through the pipeline in one pass.

GPU (PreJacGPU) is currently the state of the art for speed on these simulations - a moderate width 8-depth EML machine could process a very complex Jacobian as fast as the gpu can evaluate one exp(). Even on a sub-optimal 250mhz FPGA, an entire 50x50 Jacobian would be about 3.5 microseconds vs 50 microseconds PER Jacobian on an A100.

If you put that same logic path into an ASIC, you'd be about 20x the fPGA's speed - in the nanoseconds per round. And this is not like you're building one function into an ASIC it's general purpose. You just feed it a compiled tree configuration and run your data through it.

For anything like linear algebra math, which is also used here, you'd delegate that to the dedicated math functions on the processor - it wouldn't make sense to do those in this.

Replies

tripletao • yesterday at 7:12 PM

> The core of this is the Arrhenius rate, (A × T^n × exp(-E_a/(R×T))), which involves an exponentiation, a division, a multiplication, and an exponential. On a GPU, that's multiple SFU calls chained with ALU ops. In an EML tree, the whole expression compiles to a single tree that flows through the pipeline in one pass.

I think you're missing the reason why the GPU kicks you out of the fast path when you need that special function. The special function evaluation is fundamentally more expensive in energy, whether that cost is paid in area or time. Evaluation of the special functions with throughput similar to the arithmetic throughput would require much more area for the special functions, which for most computation isn't a good tradeoff. That's why the GPU's designers chose to make your exp2 slow.

Replacing everything with dozens of cascaded special functions makes everything uniform, but it's uniformly much worse. I feel like you're assuming that by parallelizing your "EML tree" in dedicated hardware that problem goes away; but area isn't free in either dollars or power, so it doesn't.

➕ show 1 reply

alt Hacker News

Replies