logoalt Hacker News

Sesse__today at 5:29 PM9 repliesview on HN

And similarly, entire generations of programmers were never taught Horner's scheme. You can see it in the article, where they write stuff like

  A * x * x * x * x * x * x + B * x * x * x * x + C * x * x + D
(10 muls, 3 muladds)

instead of the faster

  tmp = x * x;
  ((A * tmp + B) * tmp + C) * tmp + D
(1 mul, 3 muladds)

Replies

anematodetoday at 7:35 PM

Yep, good stuff. Another nice trick to extract more ILP is to split it into even/odd exponents and then recombine at the end (not sure if this has a name). This also makes it amenable to SLP vectorization (although I doubt the compiler will do this nicely on its own). For example something like

    typedef double v2d __attribute__ ((vector_size (16)));

    v2d packed = { x, x };
    packed = fma(packed, As, Bs);
    packed = fma(packed, Cs, Ds);
    // ...
    return x * packed[0] + packed[1]
smth like that

Actually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.

woadwarrior01today at 5:50 PM

The common subexpression elimination (CSE) pass in compilers takes care of that.

show 1 reply
eskatoday at 6:01 PM

The problem with Horner’s scheme is that it creates a long chain of data dependencies, instead of making full use of all execution units. Usually you’d want more of a binary tree than a chain.

show 2 replies
def-pri-pubtoday at 6:30 PM

The reason for writing out all of the x multiplications like that is that I was hoping the compiler detect such a pattern perform an optimization for me. Mat Godbolt's "Advent of Compiler Optimizations" series mentions some of these cases where the compiler can do more auto-optimizations for the developer.

show 1 reply
arkmmtoday at 5:44 PM

Didn't know this technique had a name, but I would think a modern compiler could make this optimization on its own, no?

show 1 reply
owlbitetoday at 6:32 PM

Not just for speed, Horner can also be essential for numerical stability.

3371today at 5:42 PM

Isn't that for... readability...?

zahlmantoday at 5:38 PM

Is this outside of what compilers can do nowadays? (Or do they refuse because it's floating-point?)

boothbytoday at 5:57 PM

Thinking about speed like this used to be necessary in C and C++ but these days you should feel free to write the most legible thing (Horner's form) and let the compiler find the optimal code for it (probably similar to Horner's form but broken up to have a shallower dependency chain).

But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.

show 3 replies