> On modern CPUs, avoiding branch misprediction is a key technique to speed up programs.
This is true but it's misleading. The reality is that modern out-of-order superscalar CPUs are so good at branch prediction that it's nearly always better to branch in a tight loop (to allow more ILP) than introduce a data-dependency in a tight loop (which limits ILP). Cf. https://mazzo.li/posts/value-speculation.html, https://yarchive.net/comp/linux/cmov.html
Branchless code should generally be avoided because modern CPUs are not designed to optimize that use case. There are exceptions of course, but those are exceptions.
I do not agree to call "exceptions" the cases where branchless code is preferable, because they can be quite frequent in certain application domains, like sorting and searching.
The difference between the cases when branches are worse and the cases when they are better, is whether the tested condition is random (i.e. unpredictable) or not.
Whenever you compare a random number with a threshold (or two random numbers between themselves) and use the result for conditional execution, that is an example where using branches is worse.
In most cases, when writing a program it is easy to estimate whether branches will be predictable or not, and in the latter case branchless methods should be used.
I was thinking that... When they say "modern CPUs", surely that includes any pipelined CPU? Maybe Pentium 4 era long pipelines in particular. But actual modern CPUs are much better at branch prediction.
But for something like sorting wouldn't the worst case be completely random data which would defeat any kind of branch prediction?
Branchful only wins via ILP when data becomes good predictable. But since Quicksort partitioning aims for a 50/50 split, it operates in the worst possible zone for a branch predictor. That's why branchless wins here, as proven by the benchmarks.