Compilers already optimize division of a 32 bit integer x by a constant d like:
c = 2^a / d
so: d = 2^a / c
so: x / d = (x * c) / 2^a
And /2^a is equivalent to >>a, so we have: x / d = (x * c) >> a
Which is just a multiply and shift, cheaper than integer division.Problem is, with some divisors like 7, 14, 21, ... the smallest c that works to produce an exact result is 33 bits. Awkward on 32 bit CPUs, it just barely doesn’t fit but you can account for that extra 1 bit manually with an extra sequence of sub, shift, add.
What the paper points out is twofold:
1. On 64 bit CPUs the extra sequence isn’t required, you can just do a full 64x64 bit multiply and 128 bit shift. Back to just a multiply and shift like the original optimization, and already faster than the workaround.
2. Even better, the shift itself can be optimized away. A 64x64 bit multiply stores the 128 bit product in a pair of 64 bit registers, meaning you basically get >> 64 for free by just looking at the upper half register in the pair. That means if you pre-shift the constant to:
c’ = c << (64 - a)
Now (x * c’) >> 64 is equivalent to (x * c) >> a. And the result is just waiting for you in a 64 bit register, no shift needed.Just one multiply to divide a 32 bit integer by 7. Nice.
I must admit I'm surprised to see this -- Lemire offhandedly mentioned in the famous remainder blog post (https://lemire.me/blog/2019/02/08/faster-remainders-when-the...) that 64-bit constants can be used for 32-bit division, and even provided a short example to compute the remainder that way (though not the quotient). Looking a bit more, it seems like libdivide didn't integrate this optimization either.
I guess everyone just assumed that this is so well-known now, that compilers have certainly integrated it, but no one actually bothered to submit a patch until now, when it was reinvented?
The linked clang PR is also very readable.
https://github.com/llvm/llvm-project/pull/181288/files
As the PR clearly points out, you can do this in a register but not inside vectors.
I don't think fastdiv has had an update in years, which what I've used because compilers can't do "this is a constant for the next loop of 1024" like columnar sql needs.
At this rate it might be worthwhile turning integer divisions by an unknown which is constant in a loop into a call to a routine to generate the inverse and then use that within the loop. Though I guess that would be pretty infrequent compared to the floating point equivalent.
> This LLVM change has been merged into llvm:main. We also prepared a corresponding GCC patch [5], which is currently under review.
Best thing in the paper, bravo :)
Look up the book Hackers Delight if you like there low level algorithms and bit manipulations
[flagged]
This paper missed the state of the art!!!
This concerns unsigned division by a 32 bit constant divisor. Compilers routinely optimize this to a high-multiply by a "magic number" but this number may be up to 33 bits (worst-case, divisor dependent, 7 is a problem child). So you may need a 32x33-bit high multiply.
What compilers do today is some bit tricks to perform a 32x33 bit high multiply through a 32x32 multiply and then handling the last bit specially, through additions and bitshifts. That's the "GM Method" in the paper; the juice to be squeezed out is the extra stuff to handle the 33rd bit.
What the paper observes is that 32x33 high multiply can be performed via a 64x64 high multiply and then the extra stuff goes away. Well yes, of course.
But amazingly in ALL of these worst case situations you can compute a different magic number, and perform a (saturating) increment of the dividend at runtime, and ONLY need a 32 bit high multiply.
That is, these are the two algorithms for unsigned division by constants where the magic number overflows:
- This paper: Promote to twice the bit width, followed by high multiply (32x64 => 64) and then bitshift
- SOTA: Saturating increment of the dividend, then high multiply (32x32 => 32) and then bitshift
Probably the paper's approach is a little better on wide multipliers, but they missed this well-known technique published 15 years ago (and before that, I merely rediscovered it):
https://ridiculousfish.com/blog/posts/labor-of-division-epis...