The interesting challenge with async/await on GPU is that it inverts the usual concurrency mental model. CPU async is about waiting efficiently while I/O completes. GPU async is about managing work distribution across warps that are physically executing in parallel. The futures abstraction maps onto that, but the semantics are different enough that you have to be careful not to carry over intuitions from tokio/async-std.
The comparison to NVIDIA's stdexec is worth looking at. stdexec uses a sender/receiver model which is more explicit about the execution context. Rust's Future trait abstracts over that, which is ergonomic but means you're relying on the executor to do the right thing with GPU-specific scheduling constraints.
Practically, the biggest win here is probably for the cases shayonj mentioned: mixed compute/memory pipelines where you want one warp loading while another computes. That's exactly where the warp specialization boilerplate becomes painful. If async/await can express that cleanly without runtime overhead, that is a real improvement.
[dead]
I had a longer, snarkier response to this the (as I'm writing) top comment on this thread. I spent longer than I'd like to have trying to decode what insight you were sharing here (what exactly is inverted in the GPU/CPU summaries you give?) until I browsed your comment history and saw what looks like a bunch of AI-generated comments (sometimes less than a minute apart from each other) and realized I was trying to decode slop.
This one's especially clear because you reference "the cases shayonj mentioned", but shayonj's comment[1] doesn't mention any use cases, but it does make a comparison to "NVIDIA's stdexec", which seems like might have gotten mixed into what your model was trying to say in the preceding paragraph?
This is really annoying. Please stop.
[1] https://news.ycombinator.com/item?id=47050304