Nitpicking the C variant:
> #define BLQS_CMP(a, b) ((a) < (b))
A function that returns true when one operand is Less Than the other, should be called BLQS_LT. The CMP abbreviation is idiomatic for a function that returns -1,0, or 1.
for (int i = 0; i < 1000; i++) {
small_numbers[smlen] = numbers[i];
smlen += (numbers[i] < 500);
}
is much faster than the conventional version with a conditional branch: for (int i = 0; i < 1000; i++) {
if (numbers[i] < 500) {
small_numbers[smlen] = numbers[i];
smlen += 1;
}
}
Been staring at this for a bit, but my brain is not working properly today: could someone please explain how these to loops compute the same value for small_numbers[smlen]?> On modern CPUs, avoiding branch misprediction is a key technique to speed up programs.
This is true but it's misleading. The reality is that modern out-of-order superscalar CPUs are so good at branch prediction that it's nearly always better to branch in a tight loop (to allow more ILP) than introduce a data-dependency in a tight loop (which limits ILP). Cf. https://mazzo.li/posts/value-speculation.html, https://yarchive.net/comp/linux/cmov.html
Branchless code should generally be avoided because modern CPUs are not designed to optimize that use case. There are exceptions of course, but those are exceptions.
It's unfortunate that the C++ version of the code assumes the type T is default-constructible (and that the default constructor is cheap). It also assumes that the type T is copy-constructible; at a glance I can't tell if the algorithm depends on making a copy in every place that it does make a copy. E.g. in the `heap_sort` helper we have
T k; // default-construct
if (i > 0) k = left[--i]; // copy-assign
This fairly obviously could be replaced with "copy-construct." Could it be replaced with "move-construct"? I don't know.
Again, in `partition_small`, we have T swbuf[SMALLPART];
which default-constructs a bunch of Ts. I think we're just going to overwrite that memory in a moment anyway, so constructing all those Ts is a waste of cycles; but I'm not sure.All of my "I don't knows" and "I'm not sures" are due to my own lack of digging into the code; I'm sure one could find out if one really looked.
None of that matters if you're just sorting `int` or the benchmarked `struct entry`. But it matters a great deal if you're taking the README literally and trying to sort "types with higher copy costs [...] (such as strings)".
Aren't there several bitonic sort network implementations that are vectorized, Intel's in particular?
Why not compare against that?
I‘m always a bit envious when I see those branchless styles. In my day job I have the obligation to hit 100% modified condition/decision coverage, and I‘m daydreaming about having just one control flow through everything, in order to save module tests that only test the umpteenth condition combination.
Obviously, readable code wins, but at least once I had the computing time budget to be able to have a central function go straight through by calculating all five or so variations (it was about several kinds of encodings of the output values) and just pick the correct one in the end. That felt good.
It is so simple that I had to look very slowly to understand. Nicely done.
>On modern CPUs, avoiding branch misprediction is a key technique to speed up programs. This branchless approach:
>
>for (int i = 0; i < 1000; i++) {
> small_numbers[smlen] = numbers[i];
> smlen += (numbers[i] < 500);
>}
Excuse my terrible ignorance but isn't there still a branch? If numbers[i] < 500 then 1 else 0? I would think something like addition plus a bit comparison would avoid said branch. Unless compilers already optimize the code, but then wouldn't they also optimize the naive piece of code?
On what datatype though, e.g. for sorting arbitrary length strings? I think that is if the comparator is expensive, quicksort and variants do not win because they do a constant factor more comparisons
If it's branch predicting, why would if statement run slow? How come unnecessary memory write is fine?
Anyone interested in branch-free code might like the book Hacker's Delight. Lots of examples of stuff like this in there.
Off topic. But how are you supposed to explore that website? There seems to be no way to look at other blog posts, https://tiki.li/blog/ errors out. https://tiki.li doesn't seem to have a link to the blog.
Since pdqsort (an older project of mine) was mentioned, I felt it wouldn't be entirely inappropriate to mention that I've since then collaborated with Lukas Bergdoll to provide two high-quality sort implementations for the Rust standard library, ipnsort (unstable) and driftsort (stable).
So if you use Rust, you get these by simply calling [T]::sort(_unstable). Great performance out of the box :)
On my machine (Apple M2), using the benchmarks from the repository on Apple clang 17 and Rust 1.98 nightly:
And now for a cool party trick, let's repeat the 50 million doubles experiment again, but have the first 90% already sorted, last 10% random: