Finding Optimal GPU Occupancy
Overview
Tutorial: 20 min
Understand the concept of GPU occupancy.
Learn how to calculate and optimize GPU occupancy.
Use CUDA API functions to determine optimal block sizes and active blocks per multiprocessor.
GPU occupancy is a measure of how effectively the GPU’s resources are utilized by a kernel. High occupancy can lead to better performance,
Important
Occupancy is not the only factor that determines performance; memory bandwidth and instruction throughput also play significant roles.
cudaOccupancyMaxPotentialBlockSize()
cudaOccupancyMaxPotentialBlockSize() is a CUDA API function that helps determine the maximum number of
threads per block that can be launched on a GPU while maximizing occupancy. It calculates the optimal
block size and the minimum number of blocks required to achieve maximum occupancy for a given kernel.
This function takes into account the kernel’s resource usage, such as shared memory and registers, and returns the maximum block size and the minimum number of blocks needed to achieve optimal occupancy. It is particularly useful for optimizing kernel launches, as it helps developers choose the best block size and grid configuration for their specific kernel and GPU architecture.
1cudaOccupancyMaxPotentialBlockSize(
2 &minGridSize, // minimum grid size needed to achieve the best potential
3 &blockSize, // Block size
4 vectorAdd2D, // Kernel function
5 0, // Per-block dynamic shared memory usage intended, in bytes
6 0)
cudaOccupancyMaxActiveBlocksPerMultiprocessor
cudaOccupancyMaxActiveBlocksPerMultiprocessor() is a CUDA API function that calculates the maximum number
of active blocks that can be launched per multiprocessor on a GPU for a given kernel. It helps developers
determine the optimal number of blocks to launch in order to maximize GPU occupancy and performance.
This function takes into account the kernel’s resource usage, such as shared memory and registers, and returns
the maximum number of active blocks that can be launched per multiprocessor for the specified kernel.
1cudaOccupancyMaxActiveBlocksPerMultiprocessor(
2 &numBlocks, // Number of blocks
3 vectorAdd2D, // Kernel function
4 blockSize, // Block size
5 0) // Per-block dynamic shared memory usage intended, in bytes
Key Points
GPU occupancy is a measure of how well the GPU’s resources are utilized by a kernel.
High occupancy can lead to better performance, but it is not the only factor.
Use cudaOccupancyMaxPotentialBlockSize() to find the optimal block size for maximum occupancy.
Use cudaOccupancyMaxActiveBlocksPerMultiprocessor() to determine the maximum number of active blocks per multiprocessor.
Consider other factors like memory bandwidth and instruction throughput when optimizing performance.