CUDA Programming Model

Chapter 2 — Threads, blocks, grids and kernel functions

The Thread Hierarchy

CUDA organizes parallel execution into a three-level hierarchy:

Thread — the smallest unit of execution. Each thread runs the kernel function independently.
Block — a group of threads that can cooperate via shared memory and synchronization.
Grid — a collection of blocks that together execute the kernel.

Think of it this way

A Grid is a city. Each Block is a building. Each Thread is a worker inside a building. Workers in the same building can talk to each other (shared memory), but workers across buildings communicate only through global memory.

Thread Indexing

Every thread has built-in variables to identify itself:

threadIdx.x, threadIdx.y, threadIdx.z — thread index within its block
blockIdx.x, blockIdx.y, blockIdx.z — block index within the grid
blockDim.x, blockDim.y, blockDim.z — number of threads per block in each dimension
gridDim.x, gridDim.y, gridDim.z — number of blocks in the grid in each dimension

The global thread index is computed as:

int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

Kernel Function Qualifiers

CUDA extends C/C++ with three function qualifiers:

__global__ — runs on the GPU, called from the CPU (or from another kernel). Must return void.
__device__ — runs on the GPU, callable only from other GPU functions.
__host__ — runs on the CPU. This is the default for all normal C/C++ functions.

You can combine __host__ __device__ to compile a function for both CPU and GPU.

Example: Vector Addition

Vector addition is the "Hello World" of GPU computing. Each thread adds one element:

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950

#include <stdio.h>

__global__ void vecAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    int n = 1024;
    size_t bytes = n * sizeof(float);

    // Allocate host memory
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);

    // Initialize vectors
    for (int i = 0; i < n; i++) {
        h_a[i] = i * 1.0f;
        h_b[i] = i * 2.0f;
    }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy data to GPU
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch kernel: 4 blocks × 256 threads = 1024 threads
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);

    // Copy result back to CPU
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    // Verify
    printf("c[0] = %f, c[1023] = %f\n", h_c[0], h_c[1023]);

    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

Key Takeaways from this Example

The boundary check if (i < n) is critical — we may launch more threads than data elements.
The formula (n + threadsPerBlock - 1) / threadsPerBlock rounds up the number of blocks so every element is covered.
Memory must be explicitly allocated on the GPU with cudaMalloc and copied with cudaMemcpy.

Choosing Block Size

Block size (threads per block) is a performance-critical parameter:

Must be a multiple of 32 (the warp size). Common choices: 128, 256, 512.
Maximum is typically 1024 threads per block.
Larger blocks use more shared memory and registers per block, which can limit occupancy.

Rule of Thumb

Start with 256 threads per block. Profile and adjust from there.

Multi-Dimensional Grids and Blocks

For 2D problems (like image processing), you can use 2D blocks and grids:

1234567891011

dim3 threadsPerBlock(16, 16);     // 16×16 = 256 threads
dim3 blocksPerGrid(
    (width + 15) / 16,
    (height + 15) / 16
);

myKernel<<<blocksPerGrid, threadsPerBlock>>>(...);

// Inside the kernel:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;

Summary

CUDA threads are organized into Blocks inside a Grid
Use blockIdx, threadIdx, and blockDim to compute each thread's unique work index
__global__ functions (kernels) run on the GPU and are launched from the CPU
Always check array bounds inside kernels to avoid out-of-range writes

← PreviousIntroduction to CUDA Next →Memory Management