logoalt Hacker News

Faster asin() was hiding in plain sight

152 pointsby def-pri-pubtoday at 2:35 PM88 commentsview on HN

Comments

jason_stoday at 3:45 PM

While I'm glad to see the OP got a good minimax solution at the end, it seems like the article missed clarifying one of the key points: error waveforms over a specified interval are critical, and if you don't see the characteristic minimax-like wiggle, you're wasting easy opportunity for improvement.

Taylor series in general are a poor choice, and Pade approximants of Taylor series are equally poor. If you're going to use Pade approximants, they should be of the original function.

I prefer Chebyshev approximation: https://www.embeddedrelated.com/showarticle/152.php which is often close enough to the more complicated Remez algorithm.

show 1 reply
xt00today at 4:43 PM

To be accurate, this is originally from Hastings 1955, Princeton "APPROXIMATIONS FOR DIGITAL COMPUTERS BY CECIL HASTINGS", page 159-163, there are actually multiple versions of the approximation with different constants used. So the original work was done with the goal of being performant for computers of the 1950's. Then the famous Abramowitz and Stegun guys put that in formula 4.4.45 with permission, then the nvidia CG library wrote some code that was based upon the formula, likely with some optimizations.

LegionMammal978today at 3:28 PM

In general, I find that minimax approximation is an underappreciated tool, especially the quite simple Remez algorithm to generate an optimal polynomial approximation [0]. With some modifications, you can adapt it to optimize for either absolute or relative error within an interval, or even come up with rational-function approximations. (Though unfortunately, many presentations of the algorithm use overly-simple forms of sample point selection that can break down on nontrivial input curves, especially if they contain small oscillations.)

[0] https://en.wikipedia.org/wiki/Remez_algorithm

show 2 replies
cmovqtoday at 5:44 PM

> After all of the above work and that talk in mind, I decided to ask an LLM.

Impressive that an LLM managed to produce the answer from a 7 year old stack overflow answer all on its own! [1] This would have been the first search result for “fast asin” before this article was published.

[1]: https://stackoverflow.com/a/26030435

show 1 reply
exmadscientisttoday at 4:06 PM

This line:

> This amazing snippet of code was languishing in the docs of dead software, which in turn the original formula was scrawled away in a math textbook from the 60s.

was kind of telling for me. I have some background in this sort of work (and long ago concluded that there was pretty much nothing you can do to improve on existing code, unless either you have some new specific hardware or domain constraint, or you're just looking for something quick-n-dirty for whatever reason, or are willing to invest research-paper levels of time and effort) and to think that someone would call Abramowitz and Stegun "a math textbook from the 60s" is kind of funny. It's got a similar level of importance to its field as Knuth's Art of Computer Programming or stuff like that. It's not an obscure text. Yeah, you might forget what all is in it if you don't use it often, but you'd go "oh, of course that would be in there, wouldn't it...."

show 2 replies
AlotOfReadingtoday at 3:24 PM

I'm pretty sure it's not faster, but it was fun to write:

    float asin(float x) {
      float x2 = 1.0f-fabs(x);
      u32 i = bitcast(x2);
      i = 0x5f3759df - (i>>1);
      float inv = bitcast(i);
      return copysign(pi/2-pi/2*(x2*inv),x);
    }
Courtesy of evil floating point bithacks.
show 5 replies
scottlambtoday at 3:16 PM

Isn't the faster approach SIMD [edit: or GPU]? A 1.05x to 1.90x speedup is great. A 16x speedup is better!

They could be orthogonal improvements, but if I were prioritizing, I'd go for SIMD first.

I searched for asin on Intel's intrinsics guide. They have a AVX-512 instrinsic `_mm512_asin_ps` but it says "sequence" rather than single-instruction. Presumably the actual sequence they use is in some header file somewhere, but I don't know off-hand where to look, so I don't know how it compares to a SIMDified version of `fast_asin_cg`.

https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

show 3 replies
orangepandatoday at 2:57 PM

> Nobody likes throwing away work they've done

I like throwing away work I've done. Frees up my mental capacity for other work to throw away.

sixotoday at 4:44 PM

It appears that the real lesson here was to lean quite a bit more on theory than a programmer's usual roll-your-own heuristic would suggest.

A fantastic amount of collective human thought has been dedicated to function approximations in the last century; Taylor methods are over 200 years old and unlikely to come close to state-of-the-art.

glitchctoday at 3:45 PM

The 4% improvement doesn't seem like it's worth the effort.

On a general note, instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs. So, replacing one with the other will not confer much benefit, as evidenced from the results. They're all typically implemented using LUTs, and it's hard to beat the performance of an optimized LUT, which is basically a multiplexer connected to some dedicated memory cells in hardware.

show 4 replies
empiricustoday at 4:24 PM

Does anyone knows the resources for the algos used in the HW implementations of math functions? I mean the algos inside the CPUs and GPUs. How they make a tradeoff between transistor number, power consumption, cycles, which algos allow this.

erichoceantoday at 2:51 PM

Ideal HN content, thanks!

ok123456today at 4:22 PM

Chebyshev approximation for asin is sum(2T_n(x) / (pi*n*n),n), the even terms are 0.

drsopptoday at 3:10 PM

Did some quick calculations, and at this precision, it seems a table lookup might be able to fit in the L1 cache depending on the CPU model.

show 3 replies
stephc_int13today at 3:43 PM

My favorite tool to experiment with math approximation is lolremez. And you can easily ask your llm to do it for you.

varispeedtoday at 7:06 PM

If you are interested in such "tricks", you should check out the classic Hacker's Delight by Henry Warren

adampunktoday at 3:07 PM

We love to leave faster functions languishing in library code. The basis for Q3A’s fast inverse square root had been sitting in fdlibm since 1986, on the net since 1993: https://www.netlib.org/fdlibm/e_sqrt.c

show 1 reply
patchnulltoday at 5:23 PM

[flagged]

show 2 replies
patchnulltoday at 3:07 PM

[flagged]

show 2 replies