> OS threads are expensive: an operating system thread typically reserves a megabyte of stack space
Why is reserving a megabyte of stack space "expensive"?
> and takes roughly a millisecond to create
I'm not sure where this number is from, it seems off by a few orders of magnitude. On Linux, thread creation is closer to 10 microseconds.
Yeah, none of this makes sense to me. Allocating memory for stack space is not expensive (and the default isn't even 1MB??) because you're just creating a VMA and probably faulting in one or two pages.
They also say:
>The system spends time managing threads that could be better spent doing useful work.
What do they think the async runtime in their language is doing? It's literally doing the same thing the kernel would be doing. There's nothing that intrinsically makes scheduling 10k couroutines in userspace more efficient than the kernel scheduling 10k threads. Context switches are really only expensive when the switch is happening between different processes, the overhead of a context switch on a CPU between two threads in the same process is very small (and they're not free when done in userspace anyway).
There are advantages to doing scheduling in the kernel and there are advantages to doing scheduling in userspace, but this article doesn't really touch on any of the actual pros and cons here, it just assumes that userspace scheduling is automatically more efficient.
1 megabyte stacks mean ten thousand threads require 10 gigabytes of RAM just for the stacks. The entire point of the asynchronous programming paradigm is to reclaim all of those gigabytes by not allowing stacks to develop at all, by stealthily turning everything into a hidden form of cooperative multitasking instead.
The author doesn't fully justify the assertion but it does have sound basis.
While virtual memory allocation does not require physical allocation, it immediately runs into the kinds of performance problems that huge pages are designed to solve. On modern systems, you can burn up most of your virtual address space via casual indifference to how it maps to physical memory and the TLB space it consumes. Spinning up thousands of stacks is kind of a pathological case here.
10µs is an eternity for high-performance software architectures. That is also around the same order of magnitude as disk access with modern NVMe. An enormous amount of effort goes into avoiding blocking on NVMe disk access with that latency for good reason. 10µs is not remotely below the noise floor in terms of performance.
> Why is reserving a megabyte of stack space "expensive"?
Guess it's not a huge issue in these 64-bit days, but back in the 32-bit days it was a real limitation to how many threads you could spin up due to the limited address space.
Of course most applications which hit this would override the 1MB default.
There's much ridiculous hatred for OS threads based on people's biases of operating systems and hardware from 20 years ago.
So much so that they'll sign themselves up for async frameworks that thread steal at will and bounce things all over cores causing cache line bouncing and associated memory stalls, not understanding what this is doing to their performance profile.
And endure complexity, etc. through awkward async call chains and function colouring.
Most people's applications would be totally fine just spawning OS threads and using them without fear and dropping into a futex when waiting on I/O; or using the kernel's own async completion frameworks. The OS scheduler is highly efficient, and it is very good at managing multiple cores and even being aware asymmetrical CPU hierarchies, etc.. Likely more efficient than half the async runtimes out there.
> Why is reserving a megabyte of stack space "expensive"?
Equally, if a megabyte of stack is a lot for your usecase, can't you just ask pthreads to reserve less? I believe it goes down to like 16k
> Why is reserving a megabyte of stack space "expensive"?
Because if you use one thread for each of your 10,000 idle sockets you will use 10GB to do nothing.
So you'll want to use a better architecture such as a thread pool.
And if you want your better architecture to be generic and ergonomic, you'll end up with async or green threads.