> This is not the first time we can see Nvidia taking shortcuts to achieve maximum performance of their GPUs
Why is implementing it correctly not performant? For context I have no idea how rounding is typically implemented anyways.
Denormals happen to be the way that Zero can even be represented at all?
Another thing to keep in mind is that CPU processing of denormals tends to be extremely slow - I vaguely recall running into something like a 10x slowdown a decade ago.
For a lot of applications the difference between a denormal and zero is small enough to be irrelevant, so if you expect near-zero values to be common, enabling a denormals-to-zero compiler flag might give you a pretty nice performance boost for free.
Flush denormals to zero. Even their inventor had trouble writing correct code in their presence - see the Appendix to that "what every programmer should know..." paper
It's one of several issues with the design of IEEE floats, unfortunately. I wish we could start thinking more seriously about a new design, to complement if not replace IEEE in the long term. Posits are an example https://github.com/andrepd/posit-rust
WebGPU (WGSL) handles this by having a specified accuracy for each operation.
https://www.w3.org/TR/WGSL/#concrete-float-accuracy
This is all fully tested in the CTS.
https://gpuweb.github.io/cts/standalone/?q=webgpu:shader,*