I'm quite interested in how they dealt with Rust's memory model, which might not neatly map to CUDA's semantics. Curious what the differences are compared to CUDA C++, and if the Rust's type system can actually bring more safety to CUDA (I do think writing GPU kernels is inherently unsafe, it's just too hard to create a safe language because of how the hardware works, and because of the fact that you're hyper-optimizing all the time)
I wonder what it means for Slang[0]. Presumably the point is that people want to do GPU programming with a more modern language. But now you can just use Rust...
(Disclaimer: I like Slang a lot.)
Re: Rust (and "safe" programming languages).
Does anyone have more details on NVIDIAs use of Spark/Ada?
All I can find is what's listed below:
https://www.adacore.com/case-studies/nvidia-adoption-of-spar...
> directly to PTX
Weird. There's a recent NVIDIA MLIR that is quite good and fast. Or they could target the even easier and more recent/fashionable tile IR [1] used by CuTile [2] (a little bit higher level but significantly easier to target, only loses on epilogue fusion and similar).
> (em dash) no DSLs, no foreign language bindings, just Rust.
Official CUDA port and they couldn't even bother with the introductory paragraph.
Okay, I'll try to ignore it and read the docs. Hey a custom IR, this sounds interesti-
> MLIR’s implementation, however, is C++ with a side of TableGen, a build system that requires you to compile all of LLVM, and debugging sessions that make you question your career choices.
I can't take this industry seriously anymore.
TileLang https://github.com/tile-ai/tilelang and stuff like Tile Kernels https://github.com/deepseek-ai/TileKernels will make CUDA obsolete one day.
Does anyone know if this will let you share structs between host and device? That is the big thing missing so far with existing rust/CUDA workflows. (Plus the serialization/bytes barrier between them)
One thing I’ve been wary about with Rust for CUDA is the bit of overhead that Rust adds that is usually negligible but might matter here, like bounds checks on arrays. Could it cause additional registers to get used, lowering the concurrency of a kernel?
Hell yea! I have been doing it with Cudarc (Kernels) and FFI (cuFFT). Using manual [de]serialization between byte arrays and rust data structs. I hope this makes it lower friction!
https://nvlabs.github.io/cuda-oxide/gpu-safety/the-safety-mo...
> A GPU kernel runs thousands of threads that all see the same memory at the same time. On a CPU, Rust prevents data races through ownership and borrowing – one mutable reference, no aliases, enforced at compile time. On a GPU, you have 2048 threads per SM, all launched from the same function, all pointing at the same output buffer. The borrow checker was not designed for this.
> cuda-oxide solves the problem in layers. The common case – one thread writes one element – is safe by construction, no unsafe required. The uncommon cases – shared memory, warp shuffles, hardware intrinsics – require unsafe with documented contracts. And the frontier cases – TMA, tensor cores, cluster-level communication – are fully manual, matching the complexity of the hardware they control.
That's.. not really Rusty. In Rust, we create new safe abstractions when the existing ones don't quite map to the problem at hand. See for example what's done in Rust for Linux
If it's not safe.. what's the point of Rust?
(it's okay to offer unsafe APIs for people that need to squeeze the last bit of performance, but this shouldn't be the baseline)
I compare this with userspace libs for APIs like io_uring and vulkan. designing safe APIs for them stuff is kind of hard (there's even some unsound attempts)
So, we have stainless, which means Linux code that never rusted. Now we need someone to make phosphorus so that we can turn rusty code into old iron. Then GPL fans can run Rust boxes, Stainless machines, or future proofed iron work horses.
All software can come on three editions. Stainless drivers that were never rusty, oxidized drivers that used Rust on existing code, and Iron editions which is where someone converted the Rust back to C using the new phosphoric tool...
Diversity can be our strength.
Making Iron C/c++ code can be called acid washing if it was rusted.
Here's the repo link https://github.com/NVlabs/cuda-oxide
Oh lord. If this is the trend, I probably can't avoid improving my Rust language knowledge in the long term. I hate reading Rust so much right now. I guess I just have to get over that hump.
AWESOME!
Personally I really don't want new GPU languages that do not have AD as a first class citizen. I mean rust is an improvement over C++ CUDA but still.
This is a bit good for Rust if you want to use the language with CUDA. The problem is, it still doesn't really move the needle if you really don't like running closed source drivers and runtime binaries and care about open source.
Continuing from this discussion [0], this only makes it a Rust or a CUDA problem rather than a Python, CUDA and a PyTorch one if there bug in one of them.
Yet at the end of the day, it still uses Nvidia's closed source CUDA compiler 'nvcc' which they will never open source. A least Mojo promises to open source their own compiler which compiles to different accelerators with multiple backend support.
Unlike this...but uses Rust.
[dead]
[flagged]
This is solved by Mojo already, they must be rushing something to compete, since Mojo is in version 1.0beta1
Why do we bother with programming languages today? Why not have the LLMs just write assembly code and skip the human readable part? We are not reviewing it anymore anyway.
This is amazing.. ive been working with custom CUDA kernels and https://crates.io/crates/cudarc for a long time, and this honestly looks like it could be a near drop-in replacement.
im especially curious how build times would compare? Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow. coincidentally, just last week i was profiling build times and found that tools like sccache can dramatically reduce rebuild times by caching artifacts - but you still end up paying for expensive custom nvcc invocations (e.g. candle by hugging face calls custom nvcc command in their kernel compilation): https://arpadvoros.com/posts/2026/05/05/speeding-up-rust-whi...