Dynamic Parallelism in CUDA

Overview

Tutorial: 20 min

Understand the concept of dynamic parallelism in CUDA.

Learn how to implement nested kernel launches on the GPU.

Explore the advantages and use cases of dynamic parallelism.

Dynamic parallelism allows a CUDA kernel running on the GPU to launch additional kernels from the device itself, without needing to return control to the CPU. This enables nested parallelism directly on the GPU, allowing for more flexible and adaptive computation models.

Advantages of Dynamic Parallelism:

GPU Can Launch Work Without CPU Involvement
- Traditional CUDA: All kernel launches must come from the host (CPU).
- Dynamic Parallelism: GPU can launch additional kernels internally.
- Result: Eliminates the need to return to the CPU to schedule new GPU work, saving time and reducing latency.
Better for Irregular or Recursive Workloads

Dynamic parallelism is especially useful when:

The amount of work is not known in advance.

Computation patterns depend on data-dependent branching.

Examples include:

Graph traversal

Tree-based algorithms

Adaptive mesh refinement

Sparse linear algebra

N-body simulations

More Natural Expression of Recursive or Hierarchical Algorithms

Many algorithms are naturally recursive or hierarchical in nature:

Quicksort
Depth-first search
Octree or quadtree traversal

Dynamic parallelism allows you to implement these algorithms cleanly, without flattening them into an iterative model.

Reduced CPU-GPU Synchronization Overhead

Avoids unnecessary memory transfers and sync points.
Allows decision logic to remain on the GPU, improving efficiency.

Feature	Benefit
Nested kernel launches	Enables on-demand dynamic parallelism
No CPU sync required	Lowers latency and avoids host-device transfer overhead
Supports irregular workloads	Great for graphs, trees, adaptive data structures
Recursive-friendly	Natural expression of hierarchical logic
More GPU autonomy	Allows more decisions to be made directly on the device

Compilation Requirements

To use dynamic parallelism:

GPU must support compute capability ≥ 3.5.
Use the following NVCC flags:

nvcc -arch=sm_35 -rdc=true -o program program.cu

-rdc=true enables Relocatable Device Code, required for device-side kernel launches.

Code Example

// Child kernel (launched from the device)
__global__ void childKernel(int *data, int n)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] += 1;
    }
}

// Parent kernel (launched from the host, and launches child from device)
__global__ void parentKernel(int *data, int n)
{
    int threads = 256;
    int blocks = (n + threads - 1) / threads;

    // Launch child kernel from the device (this is dynamic parallelism!)
    childKernel<<<blocks, threads>>>(data, n);
    // Implicit sync: no cudaDeviceSynchronize() inside a device kernel!
}

int main()
{
    const int N = 1024;
    size_t size = N * sizeof(int);

    int *h_data = (int *)malloc(size);
    for (int i = 0; i < N; ++i) {
        h_data[i] = i;
    }

    int *d_data;
    cudaMalloc(&d_data, size);
    cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);

    // Launch parent kernel
    parentKernel<<<1, 1>>>(d_data, N);


    cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 10; ++i) {
        printf("h_data[%d] = %d\n", i, h_data[i]);
    }

    cudaFree(d_data);
    free(h_data);
    return 0;
}

Key Points

Dynamic parallelism allows kernels to launch other kernels from the device.

It is useful for irregular workloads and recursive algorithms.

Requires compute capability ≥ 3.5 and specific NVCC flags.

Reduces CPU-GPU synchronization overhead, improving performance.