
Hacker News · Feb 17, 2026 · Collected from RSS
Article URL: https://www.vectorware.com/blog/async-await-on-gpu/ Comments URL: https://news.ycombinator.com/item?id=47049628 Points: 37 # Comments: 2
VectorWareDispatchesFebruary 17, 202615 min readPedantic mode:OffGPU code can now use Rust's async/await. We share the reasons why and what this unlocks for GPU programming.At VectorWare, we are building the first GPU-native software company. Today, we are excited to announce that we can successfully use Rust's Future trait and async/await on the GPU. This milestone marks a significant step towards our vision of enabling developers to write complex, high-performance applications that leverage the full power of GPU hardware using familiar Rust abstractions. Concurrent programming on the GPU GPU programming traditionally focuses on data parallelism. A developer writes a single operation and the GPU runs that operation in parallel across different parts of the data. fn conceptual_gpu_kernel(data) { // All threads in all warps do the same thing to different parts of data data[thread_id] = data[thread_id] * 2; } This model works well for standalone and uniform tasks such as graphics rendering, matrix multiplication, and image processing. As GPU programs grow more sophisticated, developers use warp specialization to introduce more complex control flow and dynamic behavior. With warp specialization, different parts of the GPU run different parts of the program concurrently. fn conceptual_gpu_kernel(data) { let communication = ...; if warp == 0 { // Have warp 0 load data from main memory load(data, communication); } else if warp == 1 { // Have warp 1 compute A on loaded data and forward it to B compute_A(communication); } else { // Have warp 2 and 3 compute B on loaded data and store it compute_B(communication, data); } } Warp specialization shifts GPU logic from uniform data parallelism to explicit task-based parallelism. This enables more sophisticated programs that make better use of the hardware. For example, one warp can load data from memory while another performs computations to improve utilization of both compute and memory. This added expressiveness comes at a cost. Developers must manually manage concurrency and synchronization because there is no language or runtime support for doing so. Similar to threading and synchronization on the CPU, this is error-prone and difficult to reason about. Better concurrent programming on the GPU There are many projects that aim to provide the benefits of warp specialization without the pain of manual concurrency and synchronization. JAX models GPU programs as computation graphs that encode dependencies between operations. The JAX compiler analyzes this graph to determine ordering, parallelism, and placement before generating the program that executes. This allows JAX to manage and optimize execution while presenting a high-level programming model in a Python-based DSL. The same model supports multiple hardware backends, including CPUs and TPUs, without changing user code. Triton expresses computation in terms of blocks that execute independently on the GPU. Like JAX, Triton uses a Python-based DSL to define how these blocks should execute. The Triton compiler lowers block definitions through a multi-level pipeline of MLIR dialects, where it applies block-level data-flow analysis to manage and optimize the generated program. More recently, NVIDIA introduced CUDA Tile. Like Triton, CUDA Tile organizes computation around blocks. It additionally introduces "tiles" as first-class units of data. Tiles make data dependencies explicit rather than inferred, which improves both performance opportunities and reasoning about correctness. CUDA Tile ingests code written in existing languages such as Python, lowers it to an MLIR dialect called Tile IR, and executes on the GPU. We are excited and inspired by these efforts, especially CUDA Tile. We think it is a great idea to have GPU programs structured around explicit units of work and data, separating the definition of concurrency from its execution. We believe that GPU hardware aligns naturally with structured concurrency and changing the software to match will enable safer and more performant code. The downsides of current approaches These higher-level approaches to GPU programming require developers to structure code in new and specific ways. This can make them a poor fit for some classes of applications. Additionally, a new programming paradigm and ecosystem is a significant barrier to adoption. Developers use JAX and Triton primarily for machine learning workloads where they align well with the underlying computation. CUDA Tile is newer and more general but has yet to see broader adoption. Virtually no one writes their entire application with these technologies. Instead, they write parts of their application in these frameworks and other parts in more traditional languages and models. Code reuse is also limited. Existing CPU libraries assume a conventional language runtime and execution model and cannot be reused directly. Existing GPU libraries rely on manual concurrency management and similarly do not compose with these frameworks. Ideally, we want an abstraction that captures the benefits of explicit and structured concurrency without requiring a new language or ecosystem. It should compose with existing CPU code and execution models. It should provide fine-grained control when needed, similar to warp specialization. It should also provide ergonomic defaults for the common case. Rust's Future trait and async/await We believe Rust's Future trait and async/await provide such an abstraction. They encode structured concurrency directly in an existing language without committing to a specific execution model. A future represents a computation that may not be complete yet. A future does not specify whether it runs on a thread, a core, a block, a tile, or a warp. It does not care about the hardware or operating system it runs on. The Future trait itself is intentionally minimal. Its core operation is poll, which returns either Ready or Pending. Everything else is layered on top. This separation is what allows the same async code to be driven in different environments. For more detailed info, see the Rust async book. Like JAX's computation graphs, futures are deferred and composable. Developers construct programs as values before executing them. This allows the compiler to analyze dependencies and composition ahead of execution while preserving the shape of user code. Like Triton's blocks, futures naturally express independent units of concurrency. Depending on how futures are combined, they represent whether a block of work runs serially or in parallel. Developers express concurrency using normal Rust control flow, trait implementations, and future combinators rather than a separate DSL. Like CUDA Tile's explicit tiles and data dependencies, Rust's ownership model makes data constraints explicit in the program structure. Futures capture the data they operate on and that captured state becomes part of the compiler-generated state machine. Ownership, borrowing, Pin, and bounds such as Send and Sync encode how data can be shared and transferred between concurrent units of work. Warp specialization is not typically described this way, but in effect, it reduces to manually written task state machines. Futures compile down to state machines that the Rust compiler generates and manages automatically. Because Rust's futures are just compiler-generated state machines there is no reason they cannot run on the GPU. That is exactly what we have done. A world first: async/await running on the GPU Running async/await on the GPU is difficult to demonstrate visually because the code looks and runs like ordinary Rust. By design, the same syntax used on the CPU runs unchanged on the GPU. Here we define a small set of async functions and invoke them from a single GPU kernel using block_on. Together, they exercise the core features of Rust's async model: simple futures, chained futures, conditionals, multi-step workflows, async blocks, and third-party combinators. // Simple async functions that we will call from the GPU kernel below. async fn async_double(x: i32) -> i32 { x * 2 } async fn async_add_then_double(a: i32, b: i32) -> i32 { let sum = a + b; async_double(sum).await } async fn async_conditional(x: i32, do_double: bool) -> i32 { if do_double { async_double(x).await } else { x } } async fn async_multi_step(x: i32) -> i32 { let step1 = async_double(x).await; let step2 = async_double(step1).await; step2 } #[unsafe(no_mangle)] pub unsafe extern "ptx-kernel" fn demo_async( val: i32, flag: u8, ) { // Basic async functions with a single await execute correctly on the device. let doubled = block_on(async_double(val)); // Chaining multiple async calls works as expected. let chained = block_on(async_add_then_double(val, doubled)); // Conditionals inside async code are supported. let conditional = block_on(async_conditional(val, flag)); // Async functions with multiple await points also work. let multi_step = block_on(async_multi_step(val)); // Async blocks work and compose naturally. let from_block = block_on(async { let doubled_a = async_double(val).await; let doubled_b = async_double(chained).await; doubled_a.wrapping_add(doubled_b) }); // CPU-based async utilities also work. Here we use combinators from the // `futures_util` crate to build and compose futures without writing new // async functions. use futures_util::future::ready; use futures_util::FutureExt; let from_combinator = block_on( ready(val).then(move |v| ready(v.wrapping_mul(2).wrapping_add(100))) ); } Getting this all working required fixing bugs and closing gaps across multiple compiler backends. We also encountered issues in NVIDIA's ptxas tool, which we reported and worked around. Executors on the GPU Using async/await makes it ergonomic to express concurrency on the GPU. However, in Rust futures do not execute themselves and must be driven to completion by an executor. Rust deliberately does not include a built-in executor and instead third parties provide executors with different featur