Memory Management

Chapter 3 — Global, shared, constant, and local memory

The GPU Memory Hierarchy

Understanding memory is the single most important factor for writing fast CUDA code. The GPU has multiple memory types, each with different size, speed, and scope:

  • Registers — fastest, private to each thread (a few KB)
  • Local memory — per-thread, backed by DRAM (slow), used when registers spill
  • Shared memory — on-chip scratchpad shared by all threads in a block (~48–164 KB, very fast)
  • Global memory — large DRAM accessible by all threads (GBs, slowest)
  • Constant memory — read-only, cached, 64 KB
  • Texture memory — read-only with a specialized cache for spatial locality
Speed comparison

Registers: ~1 cycle | Shared memory: ~5 cycles | Global memory: ~200–800 cycles. That's a 100× difference! Optimizing memory access is critical.

Global Memory

Global memory is the primary way to move data between host and device. You allocate with cudaMalloc and transfer with cudaMemcpy:

cuda
1234567891011121314
float *d_data;
size_t bytes = N * sizeof(float);

// Allocate on GPU
cudaMalloc(&d_data, bytes);

// Host → Device
cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);

// Device → Host
cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);

// Free GPU memory
cudaFree(d_data);

Shared Memory

Shared memory is an on-chip scratchpad that threads within the same block can use to communicate and collaborate. It's declared with __shared__:

shared_example.cu
1234567891011121314151617181920
__global__ void sharedMemExample(float *input, float *output, int n) {
    __shared__ float cache[256];

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    // Load from global memory into shared memory
    if (gid < n) {
        cache[tid] = input[gid];
    }

    // Synchronize — ensure all threads have loaded their data
    __syncthreads();

    // Now every thread in the block can read any cache[] element
    // Example: reverse order within block
    if (gid < n) {
        output[gid] = cache[blockDim.x - 1 - tid];
    }
}
Always sync before reading shared memory

If you read shared memory written by other threads without calling__syncthreads() first, you will get undefined (garbage) values.

Constant Memory

Use constant memory for small, read-only data that every thread reads (e.g., lookup tables, filter coefficients). It's cached and broadcast-optimized:

cuda
12345678910111213
__constant__ float coefficients[64];

// On the host:
float h_coeffs[64] = { /* ... */ };
cudaMemcpyToSymbol(coefficients, h_coeffs, 64 * sizeof(float));

// In the kernel:
__global__ void applyFilter(float *data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        data[i] *= coefficients[i % 64];
    }
}

Unified Memory (Managed Memory)

CUDA 6+ introduced Unified Memory, which simplifies programming by creating a single address space accessible from both CPU and GPU:

cuda
1234567891011121314
float *data;
cudaMallocManaged(&data, N * sizeof(float));

// Use on CPU
for (int i = 0; i < N; i++) data[i] = i;

// Use on GPU — no explicit copy needed!
myKernel<<<blocks, threads>>>(data, N);
cudaDeviceSynchronize();

// Read results on CPU — automatically migrated back
printf("result: %f\n", data[0]);

cudaFree(data);
When to use Unified Memory

Great for prototyping and simple programs. For maximum performance, explicitcudaMemcpy with pinned memory and streams gives you more control.

Pinned (Page-Locked) Host Memory

Regular malloc memory can be paged out by the OS, slowing transfers. Pinned memory stays in physical RAM and enables faster DMA transfers:

cuda
1234567
float *h_pinned;
cudaMallocHost(&h_pinned, bytes);  // pinned allocation

// Transfers with pinned memory are 2-3× faster
cudaMemcpy(d_data, h_pinned, bytes, cudaMemcpyHostToDevice);

cudaFreeHost(h_pinned);

Summary

  • Global memory is large but slow — minimize accesses and use coalesced patterns
  • Shared memory is fast on-chip memory for intra-block communication — always sync before reading
  • Constant memory is ideal for small, read-only, broadcast data
  • Unified Memory simplifies coding but explicit management gives more performance control
  • Pinned memory on the host accelerates host↔device transfers