Unified Memory
Overview
Time: 45 min
Learn about Unified Memory in CUDA.
Understand how Unified Memory simplifies memory management in CUDA applications.
Unified Memory is a memory management feature in CUDA (CUDA 6 and above) that allows developers to write applications without worrying about the complexities of managing memory between the host (CPU) and device (GPU). It provides a single address space for both the host and device, enabling seamless data sharing and access.
In Unified Memory we use cudaMallocManaged to allocate memory that is accessible from both the host and device.
1int *data;
2cudaMallocManaged(&data, size * sizeof(int));
3
4// Use data on the host
5for (int i = 0; i < size; i++) {
6 data[i] = i;
7}
8
9// Use data on the device
10kernel<<<blocks, threads>>>(data);
11
12// Synchronize to ensure all operations are complete
13cudaDeviceSynchronize();
Under the hood, Unified Memory automatically migrates data between the host and device as needed. This means that when the host accesses data that is currently on the device, Unified Memory will automatically transfer it to the host memory, and vice versa. This migration is managed by the CUDA runtime, which tracks memory accesses and performs the necessary transfers transparently.
The advantage of Unified Memory is: * Simplifies programming — no need to manage explicit memory transfers * Helps porting code from CPU to GPU * Reduces the complexity of memory management
However, there are some considerations to keep in mind when using Unified Memory:
May cause performance overhead due to page migration
Less fine-grained control over memory movement
Not all CUDA features are compatible with Unified Memory
Limited support for certain data structures and algorithms
Pinned Memory in Unified Memory
Managed Memory is the default type of Unified Memory allocation (using cudaMemAdvise). It allows the CUDA
runtime to automatically manage memory migration between the host and device. When you allocate managed memory,
the CUDA runtime ensures that data is available on both the host and device as needed.
Another type of Unified Memory allocation is pinned memory. This type of Unified Memory allocation allows the host memory to be pinned, which means it cannot be paged out by the operating system. Pinned memory can improve performance for certain operations, such as asynchronous data transfers, but it requires more careful management.
1// Example of using Unified Memory with pinned memory
2__global__ void kernel(int *data) {
3 int idx = blockIdx.x * blockDim.x + threadIdx.x;
4 data[idx] += 1; // Increment each element by 1
5}
6
7int main() {
8 int size = 1024;
9 int *data;
10
11 // Allocate Unified Memory with pinned memory
12 cudaMallocHost(&data, size * sizeof(int));
13
14 // Initialize data on the host
15 for (int i = 0; i < size; i++) {
16 data[i] = i;
17 }
18
19 // Launch kernel
20 kernel<<<(size + 255) / 256, 256>>>(data);
21
22 // Synchronize to ensure all operations are complete
23 cudaDeviceSynchronize();
24
25 // Free Unified Memory
26 cudaFreeHost(data);
27}
There are some disadvantages to using pinned memory:
Pinned memory is not pageable, which means it cannot be swapped out to disk by the operating system.
It can consume more system memory, as it is not eligible for paging.
It may lead to reduced system performance if too much pinned memory is used, as it can limit the amount of memory available for other processes.
cudaMemPrefetchAsync
The cudaMemPrefetchAsync function is used to prefetch data from the device to the host or vice versa in a
non-blocking manner. This can help improve performance by ensuring that data is available in the desired
memory space before it is accessed.
1// Example of using cudaMemPrefetchAsync
2int *data;
3cudaMallocManaged(&data, size * sizeof(int));
4
5// Prefetch data to the host from GPU 0
6cudaMemPrefetchAsync(data, size * sizeof(int), cudaCpuDeviceId);
7
8// Use data on the host
9for (int i = 0; i < size; i++) {
10 data[i] += 1;
11}
12
13// Prefetch data back to GPU 0
14cudaMemPrefetchAsync(data, size * sizeof(int), 0);
15
16// Launch kernel on the device
17kernel<<<blocks, threads>>>(data);
18
19// Synchronize to ensure all operations are complete
20cudaDeviceSynchronize();
Explanation
cudaCpuDeviceIdis a special device ID that refers to the host CPU.
cudaMemAdvise
The cudaMemAdvise function is used to provide advice to the CUDA runtime about how memory should be managed.
1// Example of using cudaMemAdvise
2int *data;
3cudaMallocManaged(&data, N * sizeof(int));
4
5// Initialize data on host
6for (int i = 0; i < N; ++i)
7 data[i] = i;
8
9// 1. Advise that data will be mostly read by the host (CPU)
10cudaMemAdvise(data, N * sizeof(int), cudaMemAdviseSetReadMostly, cudaCpuDeviceId);
11
12// 2. Prefer memory to be located on GPU 0
13cudaMemAdvise(data, N * sizeof(int), cudaMemAdviseSetPreferredLocation, 0);
14
15// 3. Specify that GPU 0 will access this memory
16cudaMemAdvise(data, N * sizeof(int), cudaMemAdviseSetAccessedBy, 0);
The different advices that can be provided using cudaMemAdvise include:
cudaMemAdviseSetReadMostly: Indicates that the memory will be read mostly by the host.cudaMemAdviseSetPreferredLocation: Specifies the preferred location for the memory (host or device).cudaMemAdviseSetAccessedBy: Indicates which device(s) will access the memory.
1// Example of using cudaMemAdvise with different advices
2int *data;
3cudaMallocManaged(&data, size * sizeof(int));
4
5// Advise the CUDA runtime that the data will be read mostly by the host
6cudaMemAdvise(data, size * sizeof(int), cudaMemAdviseSetReadMostly, 0);
7
8// Advise the CUDA runtime that the data will be accessed by device 0
9cudaMemAdvise(data, size * sizeof(int), cudaMemAdviseSetAccessedBy, 0);
10
11// Use data on the host
12for (int i = 0; i < size; i++) {
13 data[i] += 1;
14}
15
16// Launch kernel on the device
17kernel<<<blocks, threads>>>(data);
18
19// Synchronize to ensure all operations are complete
20cudaDeviceSynchronize();
Key Points
Unified Memory provides a single address space for both host and device memory.
It simplifies memory management by automatically migrating data between host and device.
Use
cudaMallocManagedto allocate Unified Memory.Be aware of potential performance overhead due to automatic page migration.
Not all CUDA features are compatible with Unified Memory, so check compatibility when using it.
Pinned memory can improve performance for certain operations but requires careful management.
Use
cudaMemPrefetchAsyncto prefetch data between host and device in a non-blocking manner.Use
cudaMemAdviseto provide advice to the CUDA runtime about memory management.