logoalt Hacker News

wood_spirittoday at 6:57 PM1 replyview on HN

Thanks for making and thanks for sharing :)

I’m not a parallels kind of user but I can appreciate your craft and know how rewarding these odysseys can be :)

What was the biggest “aha” moment when you worked how things interlock or you needed to make both change an and b at the same time, as either on their own slowed it down? Etc. And what is the single biggest impacting design choice?

And if you’re objective, what could be done to other tools to make them competitive?


Replies

jkool702today at 10:04 PM

So, in forkruns development there have been a few "AHA!" moments. Most of them were accompanied by a full re-write (current forkrun is v3).

The 1st AHA, and the basis for the original forkrun, was that you could eliminate a HUGE amount of the overhead of parallelizing things in shell in you use persistent workers and have them run things for you in a loop and distribute data to them. This is why the project is called "forkrun" - its short for "first you FORK, then you RUN".

The 2nd AHA, which spawned forkrun v2, was that you could distribute work without a central coordinator thread (which inevitably becomes the bottleneck). forkrun v2 did this by having 1 process dump data into a tmpfile on a ramdisk, then all the workers read from this file using a shared file descriptor and a lightweight pipe-based lock: write a newline into a shared anonymous pipe, read from pipe to acquire lock, write newline back to pipe to release it. FIFO naturally queues up waiters. This version actually worked really well, but it was a "serial read, parallel execute" design. Furthermore, the time it took to acquire and release a lock meant the design topped out at ~7 million lines per second. Nothing would make it faster, since that was the locking overhead.

The 3rd AHA was that I could make a very fast (SIMD-accellerated) delimiter scanner, post the byte offsets where lines (or batches of lines) started in the global data file, and then workers could claim batches and read data in parallel, making the design fully "parallel read + parallel execute"

The 4th AHA was regarding NUMA. it was "instead of reactively re-shuffling data between nodes, just put it on the right node to begin with". Furthermore, determine the "right node" using real-time backpressure from the nodes with a 3 chunk buffer to ensure the nodes are always fed with data. This one didn't need a rewrite, but is why forkrun scales SO WELL with NUMA.