Memory Management
Chapter 3 — Global, shared, constant, and local memory
The GPU Memory Hierarchy
Understanding memory is the single most important factor for writing fast CUDA code. The GPU has multiple memory types, each with different size, speed, and scope:
- Registers — fastest, private to each thread (a few KB)
- Local memory — per-thread, backed by DRAM (slow), used when registers spill
- Shared memory — on-chip scratchpad shared by all threads in a block (~48–164 KB, very fast)
- Global memory — large DRAM accessible by all threads (GBs, slowest)
- Constant memory — read-only, cached, 64 KB
- Texture memory — read-only with a specialized cache for spatial locality
Registers: ~1 cycle | Shared memory: ~5 cycles | Global memory: ~200–800 cycles. That's a 100× difference! Optimizing memory access is critical.
Global Memory
Global memory is the primary way to move data between host and device. You allocate with cudaMalloc and transfer with cudaMemcpy:
float *d_data; size_t bytes = N * sizeof(float); // Allocate on GPU cudaMalloc(&d_data, bytes); // Host → Device cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice); // Device → Host cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost); // Free GPU memory cudaFree(d_data);
Shared Memory
Shared memory is an on-chip scratchpad that threads within the same block can use to communicate and collaborate. It's declared with __shared__:
__global__ void sharedMemExample(float *input, float *output, int n) {
__shared__ float cache[256];
int tid = threadIdx.x;
int gid = blockIdx.x * blockDim.x + threadIdx.x;
// Load from global memory into shared memory
if (gid < n) {
cache[tid] = input[gid];
}
// Synchronize — ensure all threads have loaded their data
__syncthreads();
// Now every thread in the block can read any cache[] element
// Example: reverse order within block
if (gid < n) {
output[gid] = cache[blockDim.x - 1 - tid];
}
}If you read shared memory written by other threads without calling__syncthreads() first, you will get undefined (garbage) values.
Constant Memory
Use constant memory for small, read-only data that every thread reads (e.g., lookup tables, filter coefficients). It's cached and broadcast-optimized:
__constant__ float coefficients[64];
// On the host:
float h_coeffs[64] = { /* ... */ };
cudaMemcpyToSymbol(coefficients, h_coeffs, 64 * sizeof(float));
// In the kernel:
__global__ void applyFilter(float *data, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
data[i] *= coefficients[i % 64];
}
}Unified Memory (Managed Memory)
CUDA 6+ introduced Unified Memory, which simplifies programming by creating a single address space accessible from both CPU and GPU:
float *data;
cudaMallocManaged(&data, N * sizeof(float));
// Use on CPU
for (int i = 0; i < N; i++) data[i] = i;
// Use on GPU — no explicit copy needed!
myKernel<<<blocks, threads>>>(data, N);
cudaDeviceSynchronize();
// Read results on CPU — automatically migrated back
printf("result: %f\n", data[0]);
cudaFree(data);Great for prototyping and simple programs. For maximum performance, explicitcudaMemcpy with pinned memory and streams gives you more control.
Pinned (Page-Locked) Host Memory
Regular malloc memory can be paged out by the OS, slowing transfers. Pinned memory stays in physical RAM and enables faster DMA transfers:
float *h_pinned; cudaMallocHost(&h_pinned, bytes); // pinned allocation // Transfers with pinned memory are 2-3× faster cudaMemcpy(d_data, h_pinned, bytes, cudaMemcpyHostToDevice); cudaFreeHost(h_pinned);
Summary
- Global memory is large but slow — minimize accesses and use coalesced patterns
- Shared memory is fast on-chip memory for intra-block communication — always sync before reading
- Constant memory is ideal for small, read-only, broadcast data
- Unified Memory simplifies coding but explicit management gives more performance control
- Pinned memory on the host accelerates host↔device transfers