I'm quite interested in how they dealt with Rust's memory model, which might not neatly map to CUDA's semantics. Curious what the differences are compared to CUDA C++, and if the Rust's type system can actually bring more safety to CUDA (I do think writing GPU kernels is inherently unsafe, it's just too hard to create a safe language because of how the hardware works, and because of the fact that you're hyper-optimizing all the time)
This is explained in some detail in the docs. There is a safe layer, a mostly safe layer, and an unsafe layer. Some clunkiness is needed for safe-yet-parallel work that they couldn’t easily fit into the Rust Send/Sync model.
I think it depends on the objective. My pattern-matching brain says there will be interest in addressing this.
From my perspective of someone who writes applications in Rust and sometimes wants to use GPU compute in these applications: I don't care. If we can leverage the memory model or ownership model in a low-friction way, that's fine. If it makes it a high friction experience, I would prefer not to do it that way.
The baseline is IMO how Cudarc currently does it. I don't think there is much memory management involved; it's just imperative syntax wrapping FFI, and some lines in the build script to invoke nvcc if the kernels change.
FWIW, Rust’s memory model is more or less completely identical to C++’s, by design. Atomics work the same, there’s provenance, and so on.
Whether it is a convenient language for GPU programming probably remains to be seen, but I definitely wouldn’t be surprised if you could make a decent DSL-like API for writing safe code that leverages the full spectrum of GPU oddities. That’s what CUDA is, right?
the main 4 i see are:
1. use-after-free, drop semantics vs manual cudaFree
2. kernel args enforced using `cuda_launch!` whereas CPP void* args is just an array of pointers, validating count only
3. alias mutable writes. e.g. CPP can have more than one thread writing out[i] with same i and this will compile. but DisjointSlice<T> with ThreadIndex doesnt have any public constructor (see: https://github.com/NVlabs/cuda-oxide/blob/2a03dfd9d5f3ecba52...) and only using API of `index_1d` `index_2d` and `index_2d_runtime`
4. im pretty sure you can cuda memcpy a std::string and literally any other POD and "corrupt" its state making it unusable. here it ONLY accepts DisjointSlice<T>, scalars, and closures (https://nvlabs.github.io/cuda-oxide/gpu-programming/memory-a...)
but most of the nitty gritty is in these sections
* https://nvlabs.github.io/cuda-oxide/gpu-safety/the-safety-mo...
* https://nvlabs.github.io/cuda-oxide/gpu-programming/memory-a...
edit: that being said, not like this catch everything, just looks to give much more guardrails against UB with raw .cu files