News

NVIDIA Research unveils cuTile Rust for safe GPU kernels

cuTile Rust keeps Rust ownership intact across GPU launches, and NVIDIA says it hit about 96% of cuBLAS on a DGX B200.

Jamie Taylor··2 min read
Published
Listen to this article0:00 min
NVIDIA Research unveils cuTile Rust for safe GPU kernels
Photo illustration

NVIDIA Research is pushing Rust into one of the hardest places for its safety story to survive: GPU kernels. Its new cuTile Rust work, published June 16, 2026, tries to let systems programmers write high-performance CUDA kernels without dropping the ownership discipline that makes Rust attractive in the first place.

The paper, Fearless Concurrency on the GPU, was submitted to arXiv on June 14, 2026 under identifier 2606.15991. NVIDIA lists Melih Elibol, Jared Roesch, Isaac Gelado, Eric Buehler, and Michael Garland as the authors. The core idea is simple but consequential: mutable outputs are split into disjoint pieces before launch, immutable tensors can be shared, and ownership returns to the host when GPU work completes. That means the launch boundary is no longer treated as a place where Rust’s guarantees end.

AI-generated illustration
AI-generated illustration

cuTile Rust is also built for real orchestration, not just a single kernel demo. NVIDIA says the system supports synchronous launches, asynchronous pipelines, and CUDA graph replay, which matters for workloads that have to overlap transfers, scheduling, and execution. The project page describes it as a high-performance GPU programming library that compiles Rust code directly to CUDA kernels, while the repository calls it an early-stage research project with expected bugs, incomplete features, and API breakage. Even so, the stack is already unusually complete: a safe user-facing DSL, a safe host-side API for asynchronously executed kernel functions, and an MLIR-based compiler pipeline backed by CUDA Tile compiler technology.

The performance numbers are hard to dismiss. NVIDIA reports roughly 7 TB/s for element-wise operations and about 2 PFlop/s for GEMM on a DGX B200, with the system reaching around 96% of cuBLAS performance in the reported setup. For a safety-first Rust toolchain, that is the key signal: the ownership model is not obviously costing the kind of throughput GPU teams demand.

The broader context makes the move even more interesting. Rust already has an nvptx64-nvidia-cuda target for compiling to PTX for NVIDIA accelerators, and Rust 1.97, scheduled for July 9, 2026, will raise the baseline PTX ISA to 7.0 and the baseline GPU architecture to sm_70. NVIDIA’s CUDA Tile developer page describes CUDA Tile as a tile-based GPU programming model for Tensor Cores, which helps explain why cuTile Rust fits the direction of the stack. Michael Garland, who joined NVIDIA in 2006 and now leads programming systems and applications research, and Melih Elibol, a senior research scientist in that group, are clearly betting that Rust can move from being the host language around GPU work to a language that can describe GPU work itself. That is the real shift here: Rust is not just wrapping CUDA anymore, it is starting to shape how GPU computation is written while the safety model still holds.

This article was produced by Prism’s automated news system from verified source data, official records, and press releases, then run through automated quality and moderation checks before publishing. The system is built and supervised by the people who set the standards it runs under. Read our full AI policy.

Did this article answer your question?

Discussion

More Rust Programming News