I'd have guessed multiply-by-0 and multiply-by-1 can be special-cased to run much faster and simpler code paths, like you'd do when writing MUL for a processor that doesn't have it (I <3 z80)
I went in expecting to find 'branch prediction'[0] as the answer, but apparently things are even more complex nowadays.
[0] - https://stackoverflow.com/questions/11227809/why-is-conditio...
I can't tell from the blog, is this actually verified or is it theory and then numbers showing plausibility?
I could certainly come up with alternative theories about memory compression and prefetching if we were talking about texture reads.
I feel like many of the comments missed the point or didn't read the article. What I believe this article is stating (and I've read this many times during my PhD for various reasons), is that the input data distributions affect how many transistor state changes there are during multiplication. Since these events are a large portion of energy loss/heat generation, the clocks won't be throttled as much for certain data patterns.
There was a workshop paper from SC24 that did more experiments around this I believe. I can't find it now though.
People have been noticing the effects of this in local LLM inference. Power limiting seems to improve overall performance!
It wouldn't surprise me to see some ML algorithm in silico somewhere to select faster matmul paths on favorable data. Yo dawg, I heard you like AI, so we put some AI in your AI so you can infer while you're inferring.
Designing for predictable execution flow is one of the advantages of Tenstorrent hardware.
https://clehaxze.tw/gemlog/2025/04-21-programming-tensotrren...
https://clehaxze.tw/gemlog/2026/01-22-the-real-tenstorrent-t...
[dead]
> For example, when the GPU is fully idle, nvidia-smi tells me that it’s only pulling 88W of power.
I haven't used a non-laptop GPU in some time, but that is a crazy amount of "idle" power consumption. Is this normal for cards like this?