CUDA Programming Model

Chapter 2 — Threads, blocks, grids and kernel functions

The Thread Hierarchy

CUDA organizes parallel execution into a three-level hierarchy:

  1. Thread — the smallest unit of execution. Each thread runs the kernel function independently.
  2. Block — a group of threads that can cooperate via shared memory and synchronization.
  3. Grid — a collection of blocks that together execute the kernel.
Think of it this way

A Grid is a city. Each Block is a building. Each Thread is a worker inside a building. Workers in the same building can talk to each other (shared memory), but workers across buildings communicate only through global memory.

Thread Indexing

Every thread has built-in variables to identify itself:

  • threadIdx.x, threadIdx.y, threadIdx.z — thread index within its block
  • blockIdx.x, blockIdx.y, blockIdx.z — block index within the grid
  • blockDim.x, blockDim.y, blockDim.z — number of threads per block in each dimension
  • gridDim.x, gridDim.y, gridDim.z — number of blocks in the grid in each dimension

The global thread index is computed as:

cuda
1
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

Kernel Function Qualifiers

CUDA extends C/C++ with three function qualifiers:

  • __global__ — runs on the GPU, called from the CPU (or from another kernel). Must return void.
  • __device__ — runs on the GPU, callable only from other GPU functions.
  • __host__ — runs on the CPU. This is the default for all normal C/C++ functions.

You can combine __host__ __device__ to compile a function for both CPU and GPU.

Example: Vector Addition

Vector addition is the "Hello World" of GPU computing. Each thread adds one element:

vec_add.cu
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
#include <stdio.h>

__global__ void vecAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    int n = 1024;
    size_t bytes = n * sizeof(float);

    // Allocate host memory
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);

    // Initialize vectors
    for (int i = 0; i < n; i++) {
        h_a[i] = i * 1.0f;
        h_b[i] = i * 2.0f;
    }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy data to GPU
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch kernel: 4 blocks × 256 threads = 1024 threads
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);

    // Copy result back to CPU
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    // Verify
    printf("c[0] = %f, c[1023] = %f\n", h_c[0], h_c[1023]);

    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

Key Takeaways from this Example

  • The boundary check if (i < n) is critical — we may launch more threads than data elements.
  • The formula (n + threadsPerBlock - 1) / threadsPerBlock rounds up the number of blocks so every element is covered.
  • Memory must be explicitly allocated on the GPU with cudaMalloc and copied with cudaMemcpy.

Choosing Block Size

Block size (threads per block) is a performance-critical parameter:

  • Must be a multiple of 32 (the warp size). Common choices: 128, 256, 512.
  • Maximum is typically 1024 threads per block.
  • Larger blocks use more shared memory and registers per block, which can limit occupancy.
Rule of Thumb

Start with 256 threads per block. Profile and adjust from there.

Multi-Dimensional Grids and Blocks

For 2D problems (like image processing), you can use 2D blocks and grids:

cuda
1234567891011
dim3 threadsPerBlock(16, 16);     // 16×16 = 256 threads
dim3 blocksPerGrid(
    (width + 15) / 16,
    (height + 15) / 16
);

myKernel<<<blocksPerGrid, threadsPerBlock>>>(...);

// Inside the kernel:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;

Summary

  • CUDA threads are organized into Blocks inside a Grid
  • Use blockIdx, threadIdx, and blockDim to compute each thread's unique work index
  • __global__ functions (kernels) run on the GPU and are launched from the CPU
  • Always check array bounds inside kernels to avoid out-of-range writes