Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data (2024)

114 points • by tosh • last Saturday at 12:11 PM • 32 comments • view on HN

Comments

> For example, when the GPU is fully idle, nvidia-smi tells me that it’s only pulling 88W of power.

I haven't used a non-laptop GPU in some time, but that is a crazy amount of "idle" power consumption. Is this normal for cards like this?

➕ show 4 replies

ggambetta • today at 5:04 PM

I'd have guessed multiply-by-0 and multiply-by-1 can be special-cased to run much faster and simpler code paths, like you'd do when writing MUL for a processor that doesn't have it (I <3 z80)

➕ show 1 reply

nzach • today at 2:27 PM

I went in expecting to find 'branch prediction'[0] as the answer, but apparently things are even more complex nowadays.

[0] - https://stackoverflow.com/questions/11227809/why-is-conditio...

➕ show 4 replies

jayd16 • today at 2:56 PM

I can't tell from the blog, is this actually verified or is it theory and then numbers showing plausibility?

I could certainly come up with alternative theories about memory compression and prefetching if we were talking about texture reads.

amelius • today at 3:17 PM

Sounds like a side channel attack waiting to happen.

➕ show 1 reply

jetsamflotsam • today at 4:20 PM

I feel like many of the comments missed the point or didn't read the article. What I believe this article is stating (and I've read this many times during my PhD for various reasons), is that the input data distributions affect how many transistor state changes there are during multiplication. Since these events are a large portion of energy loss/heat generation, the clocks won't be throttled as much for certain data patterns.

There was a workshop paper from SC24 that did more experiments around this I believe. I can't find it now though.

gdevenyi • today at 12:53 PM

People have been noticing the effects of this in local LLM inference. Power limiting seems to improve overall performance!

➕ show 2 replies

bitwize • today at 3:20 PM

It wouldn't surprise me to see some ML algorithm in silico somewhere to select faster matmul paths on favorable data. Yo dawg, I heard you like AI, so we put some AI in your AI so you can infer while you're inferring.

➕ show 2 replies

evanjrowley • today at 2:46 PM

Designing for predictable execution flow is one of the advantages of Tenstorrent hardware.

https://clehaxze.tw/gemlog/2025/04-21-programming-tensotrren...

https://clehaxze.tw/gemlog/2026/01-22-the-real-tenstorrent-t...

https://arxiv.org/html/2604.03279

cold_harbor • today at 2:25 PM

[dead]

alt Hacker News

Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data (2024)

Comments