The algorithm here, fast doubling, is sequential (there is no parallelism) and is quite fast, with less than a hundred operations to arrive to the final result. This runs in nanosecond on a CPU and the benchmark in the article is mostly measuring the communication time rather than actual computation time.
They are unnecessarily filling a 99999999-sized array, only to print the last element and throw away the rest. Most of the time is going to be spent on filling this 400MB array.
Unintentionally, the article is a good cautionary tale of how algorithmic knowledge can beat raw hardware power, if you know what you're doing.
Writing C code that performs this fast doubling algorithm for Fibonacci numbers was actually quite fun. Highly recommend it.
In my experience where I didn't modulo the numbers, the compute time was dominated by the last couple of multiplications with Megabyte-sized integers.
They also do not make use of the fact that matrix multiplication is associative, so M⁴ = M² × M², M⁸ = M⁴ × M⁴, etc. That means one can compute F32 using 5 matrix multiplications. This uses 31.
It wouldn’t surprise me at all if doing those 5 matrix multiplications on a CPU were faster than doing 31 on the GPU.
Also, FTA: “To calculate extremly large fibonacci numbers we need to avoid integer overflow. A simple way to avoid that is to just take everything modulo a given number in the matrix multiplication“
⇒ This also is NOT calculating the Fibonacci numbers, copping out when things get difficult on a GPU.
If you want to compute Fn mod some constant for 1 ≤ n ≤ some large constant fast on a GPU, I think I would compute quite a few of the intermediate matrices using the fast way, then spin up some ‘threads’ to compute the intermediate versions by copying those with the ‘standard’ matrix.