How many systems are there that can't just spawn a thread for each task they have to work on concurrently? This has to be a system that is A) CPU or memory bound (since async doesn't make disk or network IO faster) and B) must work on ~tens of thousands of tasks concurrently, i.e. can't just queue up tasks and work on only a small number concurrently. The only meaningful example I can come up with are load balancers, embedded software and perhaps something like browsers. But e.g. an application server implementing a REST API that needs to talk to a database anyway to answer each request doesn't really qualify, since the database connection and the work the database itself does are likely much more resource intensive than the overhead of a thread.
Pretty much anything that needs performance and has a lot of relatively light operations is not a candidate for spawning a thread. Context switching and the cost of threads is going to kill performance. A server spawning a thread per request for relatively lightweight request is going to be extremely slow. But sure, if every REST call results in a 10s database query then that's not your bottleneck. A query to a database can be very fast though (due to caches, indices, etc.) so it's not a given that just because you're talking to a database you can just spin up new threads and it'll be fine.
EDIT: Something else to consider is what if your REST calls needs to make 5 queries. Do you serialize them? Now your latency can be worse. Do you launch a thread per query? Now you need to a) synchornize b) take x5 the thread cost. Async patterns or green threads or coroutines enable more efficient overlapping of operations and potentially better concurrency (though a server that handles lots of concurrent requests may already have "enough" concurrency anyways).
Async does make nvme io faster because queueing multiple operations on the nvme itself is faster.
I agree: fork is fast, cheap and easy. If you're spawning something for significant work it tends to be in the noise.
Linux kernel uses 8k stacks (TBH, it's been a while), but there's also some copy-on-write overhead. Still, this is not the C10k problem...
I think it's another case of the whole industry being driven by the needs of the very small number of systems that need to handle >10k concurrent requests.
I'm not sure this is correct mental model of what async solves
Async precisely improves disk/network I/O-bound applications because synchronous code has to waste a whole thread sitting around waiting for an I/O response (each with its own stack memory and scheduler overhead), and in something like an application server there will be many incoming requests doing so in parallel. Cancellation is also easier with async
CPU-bound code would not benefit because the CPU is already busy, and async adds overhead
See e.g. https://learn.microsoft.com/en-us/aspnet/web-forms/overview/... and https://learn.microsoft.com/en-us/aspnet/web-forms/overview/...