GPU Execution Model
Overview
Time: 60 min
Learn the CUDA execution model.
Understand the difference between CUDA kernels and functions.
Learn how to launch CUDA kernels and manage thread indexing.
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It allows developers to use a CUDA-enabled GPU for general-purpose processing, leveraging its massive parallelism capabilities.
CUDA Kernels vs Functions
In CUDA, both kernels and functions are used to define code behavior, but they have distinct purposes, execution models, and qualifiers. CUDA supports three types of functions based on qualifiers:
Qualifier |
Meaning |
|---|---|
|
Defines a kernel function that runs on the device (GPU) and is called from the host (CPU). |
|
Defines a function that runs on the device and can only be called from other device or global functions. |
|
Defines a function that runs on the host (CPU). This is the default for C/C++ functions. |
|
The function can be compiled for and called from both host and device. Often used for small inline functions or utilities. |
CUDA Kernel
A kernel is a special type of function that is executed in parallel by many GPU threads. Kernels are defined using the __global__ keyword and are launched
using triple angle brackets syntax.
__global__ void add(int *a, int *b, int *c) {
int i = threadIdx.x;
c[i] = a[i] + b[i];
}
// Launch kernel with 1 block of 256 threads
add<<<1, 256>>>(d_a, d_b, d_c);
CUDA Device Function
Device functions run on the GPU but are called only from other device or global functions. They are useful for modularizing GPU code.
__device__ int add(int x, int y) {
return x + y;
}
Host Function
Host functions are standard C/C++ functions that execute on the CPU. They typically manage memory and kernel launches.
void initArrays(int *a, int *b) {
for (int i = 0; i < N; ++i) {
a[i] = i;
b[i] = i * 2;
}
}
Feature |
CUDA Kernel ( |
CUDA Function ( |
|---|---|---|
Purpose |
Entry point for GPU parallel execution |
General subroutine for computation |
Runs On |
GPU |
CPU or GPU |
Called From |
Host (CPU) |
Device or Host |
Invocation |
|
Standard function call |
Return Type |
Must be |
Can return values |
Parallelism |
Runs across many GPU threads |
Runs per thread (device) or serially (host) |
Use Case |
Massively parallel tasks |
Helper or utility routines |
Execution Model
In CUDA, parallelism is achieved through a hierarchy of grids, blocks, and threads. When launching a kernel, you specify the grid and block configuration using the syntax:
kernel_name<<<numBlocks, threadsPerBlock>>>(args);
The execution model of CUDA is based on a hierarchy of threads organized into blocks and grids. This model allows for massive parallelism by executing many threads concurrently on the GPU. The basic structure is as follows:
Level |
Description |
Example |
|---|---|---|
Grid |
A grid is a collection of blocks that execute a kernel. |
A grid can be 1D, 2D, or 3D. |
Block |
A block is a group of threads that execute together and can share memory. |
Blocks can also be 1D, 2D, or 3D. |
Thread |
The smallest unit of execution. Each thread executes the same kernel code but operates on different data. |
Threads are identified by their unique thread index within a block. |
Example (1D Launch):
add<<<4, 256>>>(a, b, c);
In the example above, the kernel add() is launched with 1-D grid which has 4 blocks, each block containing 256 threads. This launches a kernel with:
4 blocks in the grid
256 threads in each block
Total threads = 4 × 256 = 1024 threads
Each performing the add() operation.
The same kernel can be launched with different configurations, such as 2D or 3D grids and blocks, to suit the problem’s dimensionality. For example, a 2D grid with 2D blocks might look like this:
kernel_name<<<dim3(2, 2), dim3(16, 16)>>>(args);
This launches a kernel with: * 2 blocks in the grid (2D) * Each block has 16 × 16 = 256 threads * Total threads = 4 × 256 = 1024 threads
Similarly we can launch a 3D kernel:
kernel_name<<<dim3(2, 2, 2), dim3(4, 4, 4)>>>(args);
This launches a kernel with:
2 × 2 × 2 = 8 blocks in the grid (3D)
Each block has 4 × 4 × 4 = 64 threads
Total threads = 8 × 64 = 512 threads
Explanation
dim3 is a built-in C++ struct used to define the dimensions of a grid or a block when launching a kernel.
It’s essentially a convenience structure that stores 3D dimensions (x, y, z), where:
x is required
y and z default to 1 if not specified
Thread and Block Indexing
Each thread is aware of its position in the block and the grid using built-in variables:
Variable |
Meaning |
|---|---|
|
Thread’s index within the block |
|
Block’s index within the grid |
|
Number of threads per block in each dimension |
|
Number of blocks in the grid in each dimension |
Computing Global Thread Index
In most cases when you are working with CUDA, you will need to compute a unique global thread index that identifies each thread across the entire grid. This is essential for accessing global memory or performing operations that require a unique identifier for each thread.
To compute the global thread index, you can use the following formula:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
This formula combines the block index (blockIdx.x) and the thread index within the block (threadIdx.x) to give a unique identifier for each thread across the entire grid. This global index allows each thread to operate on a unique portion of data, enabling parallel processing of large datasets. While the example above is for a 1D grid and block, the same concept applies to 2D and 3D configurations.
To compute the global thread index in 2D you can extend the formula as follows:
int idx = (blockIdx.x * blockDim.x + threadIdx.x) +
(blockIdx.y * blockDim.y + threadIdx.y) * gridDim.x * blockDim.x;
This formula accounts for both dimensions, allowing each thread to have a unique index in a 2D grid.
CUDA Thread Indexing
To compute the global thread index in 3D, you can further extend it:
int idx = (blockIdx.x * blockDim.x + threadIdx.x) +
(blockIdx.y * blockDim.y + threadIdx.y) * gridDim.x * blockDim.x +
(blockIdx.z * blockDim.z + threadIdx.z) * gridDim.x * gridDim.y * blockDim.x * blockDim.y;
This formula accounts for all three dimensions, ensuring that each thread has a unique index in a 3D grid.
Dimension |
Example Code |
|---|---|
1D |
|
2D |
|
3D |
|
This indexing allows each thread to access its unique portion of data in global memory, enabling efficient parallel processing.
Why Global Thread Index Matters
In CUDA, threads run the same code (SIMT model) - which stands for Single Instruction, Multiple Threads. This means that all threads in a block execute the same instruction at the same time, but they can operate on different data.
So without a way to uniquely identify itself, every thread would operate on the same memory location. The global thread index helps distribute work across threads so they operate on independent data.
Suppose we want to add two arrays a and b of size n, and store the result in c. The CUDA kernel would look like this:
__global__ void add(int *a, int *b, int *c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x; // Compute global thread index
if (idx < n) { // Ensure we don't go out of bounds
c[idx] = a[idx] + b[idx]; // Perform addition
}
}
Kernel Launch for the Example would look like this:
int n = 1000;
int threadsPerBlock = 256;
int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock;
add_arrays<<<numBlocks, threadsPerBlock>>>(a, b, c, n);
Important
When the number of CUDA threads exceeds the number of array elements, extra threads are launched, but only those whose global index is within bounds should perform useful work. The others must be guarded with a boundary check to avoid out-of-bounds memory access.
Important
All recent GPUs (V100, A100, H100) allow up to 1024 threads per block.
In addition, there are some limits on the dimensions of the grid and blocks.
Block dimension limits:
blockDim.x ≤ 1024
blockDim.y ≤ 1024
blockDim.z ≤ 64
Grid dimension limits:
gridDim.x ≤ 2^31 - 1
gridDim.y ≤ 65535
gridDim.z ≤ 65535
We launch enough threads to cover all elements in the array.
Each thread calculates a unique
idx.Thread 0 processes index 0, thread 1 processes index 1, …, thread 999 processes index 999.
Threads with index >=
nare skipped via the boundary check.
Key Points
CUDA kernels are launched with a grid of blocks, each containing threads.
Each thread has a unique global index computed from its block and thread indices.
The execution model allows for massive parallelism by executing many threads concurrently.
Understanding the execution model is crucial for writing efficient CUDA code.