Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets

107 points • by mpweiher • yesterday at 9:14 AM • 14 comments • view on HN

Comments

ridiculous_fish • today at 5:52 AM

This paper missed the state of the art!!!

This concerns unsigned division by a 32 bit constant divisor. Compilers routinely optimize this to a high-multiply by a "magic number" but this number may be up to 33 bits (worst-case, divisor dependent, 7 is a problem child). So you may need a 32x33-bit high multiply.

What compilers do today is some bit tricks to perform a 32x33 bit high multiply through a 32x32 multiply and then handling the last bit specially, through additions and bitshifts. That's the "GM Method" in the paper; the juice to be squeezed out is the extra stuff to handle the 33rd bit.

What the paper observes is that 32x33 high multiply can be performed via a 64x64 high multiply and then the extra stuff goes away. Well yes, of course.

But amazingly in ALL of these worst case situations you can compute a different magic number, and perform a (saturating) increment of the dividend at runtime, and ONLY need a 32 bit high multiply.

That is, these are the two algorithms for unsigned division by constants where the magic number overflows:

- This paper: Promote to twice the bit width, followed by high multiply (32x64 => 64) and then bitshift

- SOTA: Saturating increment of the dividend, then high multiply (32x32 => 32) and then bitshift

Probably the paper's approach is a little better on wide multipliers, but they missed this well-known technique published 15 years ago (and before that, I merely rediscovered it):

https://ridiculousfish.com/blog/posts/labor-of-division-epis...

➕ show 3 replies

foltik • today at 4:42 AM

Compilers already optimize division of a 32 bit integer x by a constant d like:

  c = 2^a / d
  so: d = 2^a / c
  so: x / d = (x * c) / 2^a

And /2^a is equivalent to >>a, so we have:

  x / d = (x * c) >> a

Which is just a multiply and shift, cheaper than integer division.

Problem is, with some divisors like 7, 14, 21, ... the smallest c that works to produce an exact result is 33 bits. Awkward on 32 bit CPUs, it just barely doesn’t fit but you can account for that extra 1 bit manually with an extra sequence of sub, shift, add.

What the paper points out is twofold:

1. On 64 bit CPUs the extra sequence isn’t required, you can just do a full 64x64 bit multiply and 128 bit shift. Back to just a multiply and shift like the original optimization, and already faster than the workaround.

2. Even better, the shift itself can be optimized away. A 64x64 bit multiply stores the 128 bit product in a pair of 64 bit registers, meaning you basically get >> 64 for free by just looking at the upper half register in the pair. That means if you pre-shift the constant to:

  c’ = c << (64 - a)

Now (x * c’) >> 64 is equivalent to (x * c) >> a. And the result is just waiting for you in a 64 bit register, no shift needed.

Just one multiply to divide a 32 bit integer by 7. Nice.

purplesyringa • today at 6:27 AM

I must admit I'm surprised to see this -- Lemire offhandedly mentioned in the famous remainder blog post (https://lemire.me/blog/2019/02/08/faster-remainders-when-the...) that 64-bit constants can be used for 32-bit division, and even provided a short example to compute the remainder that way (though not the quotient). Looking a bit more, it seems like libdivide didn't integrate this optimization either.

I guess everyone just assumed that this is so well-known now, that compilers have certainly integrated it, but no one actually bothered to submit a patch until now, when it was reinvented?

gopalv • today at 5:26 AM

The linked clang PR is also very readable.

https://github.com/llvm/llvm-project/pull/181288/files

As the PR clearly points out, you can do this in a register but not inside vectors.

I don't think fastdiv has had an update in years, which what I've used because compilers can't do "this is a constant for the next loop of 1024" like columnar sql needs.

dmcq2 • today at 2:20 PM

At this rate it might be worthwhile turning integer divisions by an unknown which is constant in a loop into a call to a routine to generate the inverse and then use that within the loop. Though I guess that would be pretty infrequent compared to the floating point equivalent.

jcalvinowens • today at 3:30 PM

> This LLVM change has been merged into llvm:main. We also prepared a corresponding GCC patch [5], which is currently under review.

Best thing in the paper, bravo :)

pkaye • today at 6:29 AM

Look up the book Hackers Delight if you like there low level algorithms and bit manipulations

trendbuilder • today at 4:23 AM

[flagged]

alt Hacker News

Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets

Comments