Memory Pool

Overview

Tutorial: 30 min

Understand the concept of memory pools in CUDA.

Learn how to create and manage memory pools using the CUDA API.

Explore the benefits of using memory pools for performance optimization.

In CUDA, memory pools are a way to efficiently manage device memory allocations by reducing the overhead of frequent cudaMalloc and cudaFree calls. The memory pool gives you more control over how memory is allocated and reused on the GPU.

Traditional cudaMalloc and cudaFree are expensive operations. In workloads with frequent allocations and deallocations this overhead can become a performance bottleneck. A memory pool is a pre-allocated chunk of memory from which smaller allocations are served. Instead of going to the OS or CUDA driver each time, memory requests are fulfilled from this pool. It allows for

Faster memory allocation/deallocation
Better memory reuse
Control over memory fragmentation

The mian differnce in implementation is that * cudaMallocAsync() replaces cudaMalloc() * cudaFreeAsync() replaces cudaFree()

Default Memory Pool

The default memory pool in CUDA does not have a fixed size. Instead, it grows and shrinks dynamically based on allocation needs, up to the limits of the available device memory.

The default pool starts empty.
When you call cudaMallocAsync(), the pool requests memory from the system as needed.
It can reuse memory from previous allocations if available.
The pool will continue to grow until:
- There is no more available device memory, or
- A soft limit (like the release threshold) is reached and enforced by your settings.

When memory is freed using cudaFreeAsync(), it is returned to the pool for reuse, rather than being returned to the system immediately. This allows for faster subsequent allocations from the memory pool.

cudaStream_t stream;
cudaStreamCreate(&stream);

float* d_ptr;
size_t size = 1024 * sizeof(float);

// Asynchronous allocation
cudaMallocAsync((void**)&d_ptr, size, stream);

// Use d_ptr in a kernel...

// Asynchronous deallocation
cudaFreeAsync(d_ptr, stream);

cudaStreamDestroy(stream);

We can get the attributes of the default memory pool using the following code:

cudaMemPoolGetAttribute(pool, cudaMemPoolAttrReservedMemCurrent, &current);
cudaMemPoolGetAttribute(pool, cudaMemPoolAttrReservedMemHigh, &high);

where cudaMemPoolAttrReservedMemCurrent gives the current size of the reserved memory in the pool, and cudaMemPoolAttrReservedMemHigh gives the maximum size of the reserved memory at any point in the pool.

Custom Memory Pool

Custom memory pools allow you to create and manage your own memory allocator instead of relying on the default memory pool. This gives you more control over how and when memory is reserved and reused, which is useful in performance-critical or memory-constrained applications.

Custom memory pools allows to:

Set allocation limits
Track memory usage independently
Control release thresholds
Isolate subsystems or tasks using separate allocators

Explanation

Control release thresholds in CUDA memory pools refer to settings that determine when the pool should release unused memory back to the system (device allocator).

Create a memory pool

cudaMemPool_t myPool;
cudaMemPoolProps props = {}; //struct that specifies the properties of the memory pool
props.allocType = cudaMemAllocationTypePinned;
props.handleTypes = cudaMemHandleTypeNone;
props.location.type = cudaMemLocationTypeDevice;
props.location.id = 0; // device ID

cudaMemPoolCreate(&myPool, &props);

The cudaMemPoolProps structure defines the properties for a custom CUDA memory pool. Below is a detailed explanation of each field:

Field	Description
allocType	Specifies the type of memory to allocate. Options include: cudaMemAllocationTypeDevice: Device memory (GPU global memory). cudaMemAllocationTypePinned: Pinned host memory, page-locked. cudaMemAllocationTypeManaged: Unified memory accessible by both host and device.
handleTypes	Specifies how memory handles can be shared across processes. Options include: cudaMemHandleTypeNone: No inter-process sharing. cudaMemHandleTypePosixFd: Shareable via POSIX file descriptors (Linux). cudaMemHandleTypeWin32: Shareable via Windows handles.
location.type	Indicates the location type of the memory. Commonly set to: cudaMemLocationTypeDevice: Memory pool is tied to a specific GPU. cudaMemLocationTypeHost: Host-based memory pool (rare).
location.id	Specifies the device or host ID. For device memory pools, this is the GPU ID (e.g., 0 for cudaSetDevice(0)).

Set attributes (optional)

Attribute

cudaMemPoolSetAttribute(myPool, cudaMemPoolAttrReleaseThreshold, 1024 * 1024); // 1 MB threshold
cudaMemPoolSetAttribute(myPool, cudaMemPoolAttrReservedMemCurrent, 512 * 1024 * 1024); // 512 MB reserved
cudaMemPoolSetAttribute(myPool, cudaMemPoolAttrReservedMemHigh, 1024 * 1024 * 1024); // 1 GB high limit

The following attributes are configured for a custom memory pool using cudaMemPoolSetAttribute. Each attribute influences the behavior of memory allocation, reuse, and release.

Attribute	Description
cudaMemPoolAttrReleaseThreshold = 1024 * 1024	Sets the maximum number of unused bytes (1 MB) the memory pool can retain before it begins releasing memory back to the system.
cudaMemPoolAttrReservedMemCurrent = 512 * 1024 * 1024	(Optional/Advanced) Suggests setting the current reserved memory to 512 MB. Not always user-configurable—used more for querying.
cudaMemPoolAttrReservedMemHigh = 1024 * 1024 * 1024	Sets a soft cap (1 GB) for the high watermark of memory usage within the pool, useful for monitoring purposes.

Explanation

The high watermark is the highest amount of memory the pool has ever allocated or reserved at any point in time.

This attribute records or sets a soft limit of 1 GB as that peak usage.

It doesn’t enforce a strict limit but serves as a reference point to monitor or track how much memory the pool is using at its peak.

This can help developers understand memory usage patterns and detect if memory consumption approaches or exceeds expected values.

Use the memory pool

cudaMallocFromPoolAsync() allocates memory from the specified memory pool instead of the default device memory allocator.

cudaStream_t stream;
cudaStreamCreate(&stream);

void* ptr;
cudaMallocFromPoolAsync(&ptr, size, myPool, stream);

Pool trimming

Pool trimming in CUDA memory management refers to the process where the memory pool releases unused memory back to the operating system or underlying system allocator.

Over time, the pool can accumulate unused memory chunks that are no longer needed by the application.
Pool trimming is the act of freeing these unused memory chunks, reducing the memory footprint of the pool.
This helps in controlling memory usage and preventing the application from holding excessive unused memory.

cudaMemPoolTrimTo(myPool, releaseThreshold); // Trim the pool to release memory below the threshold of 1 MB

Explanation

The releaseThreshold is amount of unused memory (in bytes) you want the pool to release back to the system. This controls how aggressively the pool trims unused allocations

Feature	Default Memory Pool	Custom Memory Pool
Global shared pool	Yes	No (per-instance)
Automatically initialized	Yes	No
Can be configured	Partially (release only)	Fully (limits, thresholds)
Used by `cudaMallocAsync`	Yes	No (must use `cudaMallocFromPoolAsync`)
Lifetime	Tied to context/device	You manage it

Key Points

Memory pools in CUDA allow for efficient memory management by reducing allocation overhead.

The default memory pool grows dynamically and reuses memory for faster allocations.

Custom memory pools provide more control over allocation limits, reuse policies, and release thresholds.

Use cudaMallocFromPoolAsync() to allocate from a custom memory pool.

Pool trimming helps manage memory usage by releasing unused chunks back to the system.