Branchless Quicksort faster than std:sort and pdqsort with C and C++ API

208 points • by birdculture • last Tuesday at 8:00 PM • 67 comments • view on HN

Comments

Since pdqsort (an older project of mine) was mentioned, I felt it wouldn't be entirely inappropriate to mention that I've since then collaborated with Lukas Bergdoll to provide two high-quality sort implementations for the Rust standard library, ipnsort (unstable) and driftsort (stable).

So if you use Rust, you get these by simply calling [T]::sort(_unstable). Great performance out of the box :)

On my machine (Apple M2), using the benchmarks from the repository on Apple clang 17 and Rust 1.98 nightly:

    Sorting 50 million doubles:
    ipnsort             0.79s
    blqs                0.90s
    driftsort           1.13s   (stable)
    std::sort           1.22s
    std::stable_sort    4.64s   (stable)

    Sorting 50 million (i32, i32) structs:
    ipnsort             0.82s
    blqs                0.89s
    driftsort           1.07s   (stable)
    std::sort           3.09s
    std::stable_sort    3.15s   (stable)

And now for a cool party trick, let's repeat the 50 million doubles experiment again, but have the first 90% already sorted, last 10% random:

    driftsort           0.29s   (stable)
    ipnsort             0.81s
    std::sort           1.15s
    std::stable_sort    1.63s   (stable)
    blqs                1.89s

teo_zero • today at 6:17 AM

Nitpicking the C variant:

> #define BLQS_CMP(a, b) ((a) < (b))

A function that returns true when one operand is Less Than the other, should be called BLQS_LT. The CMP abbreviation is idiomatic for a function that returns -1,0, or 1.

kleiba2 • today at 5:49 AM

    for (int i = 0; i < 1000; i++) {
        small_numbers[smlen] = numbers[i];
        smlen += (numbers[i] < 500);
    }

is much faster than the conventional version with a conditional branch:

    for (int i = 0; i < 1000; i++) {
        if (numbers[i] < 500) {
            small_numbers[smlen] = numbers[i];
            smlen += 1;
        }
    }

Been staring at this for a bit, but my brain is not working properly today: could someone please explain how these to loops compute the same value for small_numbers[smlen]?

➕ show 6 replies

bagxrvxpepzn • today at 6:40 AM

> On modern CPUs, avoiding branch misprediction is a key technique to speed up programs.

This is true but it's misleading. The reality is that modern out-of-order superscalar CPUs are so good at branch prediction that it's nearly always better to branch in a tight loop (to allow more ILP) than introduce a data-dependency in a tight loop (which limits ILP). Cf. https://mazzo.li/posts/value-speculation.html, https://yarchive.net/comp/linux/cmov.html

Branchless code should generally be avoided because modern CPUs are not designed to optimize that use case. There are exceptions of course, but those are exceptions.

➕ show 1 reply

quuxplusone • today at 1:38 AM

It's unfortunate that the C++ version of the code assumes the type T is default-constructible (and that the default constructor is cheap). It also assumes that the type T is copy-constructible; at a glance I can't tell if the algorithm depends on making a copy in every place that it does make a copy. E.g. in the `heap_sort` helper we have

    T k;                       // default-construct
    if (i > 0) k = left[--i];  // copy-assign

This fairly obviously could be replaced with "copy-construct." Could it be replaced with "move-construct"? I don't know. Again, in `partition_small`, we have

    T swbuf[SMALLPART];

which default-constructs a bunch of Ts. I think we're just going to overwrite that memory in a moment anyway, so constructing all those Ts is a waste of cycles; but I'm not sure.

All of my "I don't knows" and "I'm not sures" are due to my own lack of digging into the code; I'm sure one could find out if one really looked.

None of that matters if you're just sorting `int` or the benchmarked `struct entry`. But it matters a great deal if you're taking the README literally and trying to sort "types with higher copy costs [...] (such as strings)".

➕ show 1 reply

mgaunard • yesterday at 10:11 PM

Aren't there several bitonic sort network implementations that are vectorized, Intel's in particular?

Why not compare against that?

➕ show 3 replies

Tomte • today at 7:25 AM

I‘m always a bit envious when I see those branchless styles. In my day job I have the obligation to hit 100% modified condition/decision coverage, and I‘m daydreaming about having just one control flow through everything, in order to save module tests that only test the umpteenth condition combination.

Obviously, readable code wins, but at least once I had the computing time budget to be able to have a central function go straight through by calculating all five or so variations (it was about several kinds of encodings of the output values) and just pick the correct one in the end. That felt good.

davidkwast • yesterday at 10:07 PM

It is so simple that I had to look very slowly to understand. Nicely done.

➕ show 1 reply

kvuj • today at 1:17 AM

>On modern CPUs, avoiding branch misprediction is a key technique to speed up programs. This branchless approach:

>for (int i = 0; i < 1000; i++) {

> small_numbers[smlen] = numbers[i];

> smlen += (numbers[i] < 500);

Excuse my terrible ignorance but isn't there still a branch? If numbers[i] < 500 then 1 else 0? I would think something like addition plus a bit comparison would avoid said branch. Unless compilers already optimize the code, but then wouldn't they also optimize the naive piece of code?

➕ show 3 replies

Aardwolf • today at 9:01 AM

On what datatype though, e.g. for sorting arbitrary length strings? I think that is if the comparator is expensive, quicksort and variants do not win because they do a constant factor more comparisons

dekdrop • today at 2:14 AM

If it's branch predicting, why would if statement run slow? How come unnecessary memory write is fine?

➕ show 3 replies

globular-toast • today at 7:45 AM

Anyone interested in branch-free code might like the book Hacker's Delight. Lots of examples of stuff like this in there.

➕ show 1 reply

benj111 • today at 10:48 AM

Off topic. But how are you supposed to explore that website? There seems to be no way to look at other blog posts, https://tiki.li/blog/ errors out. https://tiki.li doesn't seem to have a link to the blog.

➕ show 2 replies

alt Hacker News

Branchless Quicksort faster than std:sort and pdqsort with C and C++ API

Comments