Performance Optimization

Chapter 7 — Memory coalescing, occupancy, and profiling

Memory Coalescing

The single biggest performance win in CUDA is coalesced memory access. When consecutive threads access consecutive memory locations, the GPU hardware combines those accesses into a single wide transaction:

coalescing.cu
123456789101112131415
// ✅ COALESCED — thread i reads element i
__global__ void good(float *data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        float val = data[i];  // consecutive threads → consecutive addresses
    }
}

// ❌ STRIDED — thread i reads element i*stride
__global__ void bad(float *data, int n, int stride) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i * stride < n) {
        float val = data[i * stride];  // scattered access = slow!
    }
}
Rule of thumb

Structure your data so that threadIdx.x maps to the innermost dimension of your arrays — this gives you coalesced access. For 2D arrays stored row-major, each thread should process one column.

Struct of Arrays vs Array of Structs

For coalesced access, prefer Struct of Arrays (SoA) overArray of Structs (AoS):

cuda
123456789101112131415
// ❌ AoS — interleaved, poor coalescing
struct Particle_AoS {
    float x, y, z;
    float vx, vy, vz;
};
Particle_AoS particles[N];
// thread i reads particles[i].x → stride of 6 floats

// ✅ SoA — contiguous arrays, perfect coalescing
struct Particles_SoA {
    float x[N], y[N], z[N];
    float vx[N], vy[N], vz[N];
};
Particles_SoA particles;
// thread i reads particles.x[i] → stride of 1 float

Occupancy

Occupancy is the ratio of active warps to the maximum number of warps a Streaming Multiprocessor (SM) can host. Higher occupancy generally means better latency hiding.

Factors that limit occupancy:

  • Registers per thread — more registers per thread → fewer threads fit on an SM
  • Shared memory per block — more shared memory per block → fewer blocks per SM
  • Block size — if block size doesn't divide evenly into SM warp slots, resources are wasted
cuda
12345678
// Query optimal block size for a kernel
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(
    &minGridSize, &blockSize,
    myKernel, 0, N);

printf("Optimal block size: %d\n", blockSize);
printf("Min grid size: %d\n", minGridSize);
Occupancy isn't everything

Higher occupancy doesn't always mean faster kernels. Sometimes reducing occupancy to use more registers or shared memory per thread yields better performance. Always profile!

Loop Unrolling

Unrolling loops reduces branch overhead and enables instruction-level parallelism:

cuda
123456789101112131415161718192021222324252627282930
// Manual unrolling
__global__ void reduceUnrolled(float *data, float *result, int n) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x * 2 + threadIdx.x;

    // Load 2 elements per thread (reduces blocks by half)
    sdata[tid] = (i < n ? data[i] : 0) +
                 (i + blockDim.x < n ? data[i + blockDim.x] : 0);
    __syncthreads();

    // Unrolled reduction
    for (int s = blockDim.x / 2; s > 32; s >>= 1) {
        if (tid < s) sdata[tid] += sdata[tid + s];
        __syncthreads();
    }

    // Warp-level reduction (no sync needed within a warp)
    if (tid < 32) {
        volatile float *vmem = sdata;
        vmem[tid] += vmem[tid + 32];
        vmem[tid] += vmem[tid + 16];
        vmem[tid] += vmem[tid + 8];
        vmem[tid] += vmem[tid + 4];
        vmem[tid] += vmem[tid + 2];
        vmem[tid] += vmem[tid + 1];
    }

    if (tid == 0) result[blockIdx.x] = sdata[0];
}

Profiling Tools

Never optimize blindly — always profile first. NVIDIA provides several tools:

  • Nsight Compute — detailed kernel-level analysis (memory throughput, occupancy, stalls)
  • Nsight Systems — system-wide timeline view (CPU/GPU interaction, stream overlap)
  • nvprof (legacy) — command-line profiler, still useful for quick checks
terminal
12345678
# Quick profiling with Nsight Systems
nsys profile ./my_cuda_app

# Detailed kernel analysis with Nsight Compute
ncu --set full ./my_cuda_app

# Legacy profiler
nvprof ./my_cuda_app

Optimization Checklist

  1. Coalesce global memory accesses — use SoA layout, align data
  2. Use shared memory — cache frequently reused data
  3. Minimize host↔device transfers — batch operations, use pinned memory
  4. Overlap compute & transfer — use multiple CUDA streams
  5. Choose the right block size — use cudaOccupancyMaxPotentialBlockSize
  6. Reduce thread divergence — avoid conditionals that split a warp
  7. Unroll critical loops — fewer instructions, better ILP
  8. Profile, don't guess — use Nsight Compute / Nsight Systems

Summary

  • Coalesced memory access is the #1 optimization — consecutive threads should access consecutive memory
  • SoA layout outperforms AoS on GPUs for this reason
  • Occupancy affects latency hiding — balance registers, shared memory, and block size
  • Always profile with Nsight tools before and after optimizing