CUDA Programming Model
Chapter 2 — Threads, blocks, grids and kernel functions
The Thread Hierarchy
CUDA organizes parallel execution into a three-level hierarchy:
- Thread — the smallest unit of execution. Each thread runs the kernel function independently.
- Block — a group of threads that can cooperate via shared memory and synchronization.
- Grid — a collection of blocks that together execute the kernel.
A Grid is a city. Each Block is a building. Each Thread is a worker inside a building. Workers in the same building can talk to each other (shared memory), but workers across buildings communicate only through global memory.
Thread Indexing
Every thread has built-in variables to identify itself:
threadIdx.x,threadIdx.y,threadIdx.z— thread index within its blockblockIdx.x,blockIdx.y,blockIdx.z— block index within the gridblockDim.x,blockDim.y,blockDim.z— number of threads per block in each dimensiongridDim.x,gridDim.y,gridDim.z— number of blocks in the grid in each dimension
The global thread index is computed as:
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
Kernel Function Qualifiers
CUDA extends C/C++ with three function qualifiers:
__global__— runs on the GPU, called from the CPU (or from another kernel). Must returnvoid.__device__— runs on the GPU, callable only from other GPU functions.__host__— runs on the CPU. This is the default for all normal C/C++ functions.
You can combine __host__ __device__ to compile a function for both CPU and GPU.
Example: Vector Addition
Vector addition is the "Hello World" of GPU computing. Each thread adds one element:
#include <stdio.h>
__global__ void vecAdd(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
int main() {
int n = 1024;
size_t bytes = n * sizeof(float);
// Allocate host memory
float *h_a = (float*)malloc(bytes);
float *h_b = (float*)malloc(bytes);
float *h_c = (float*)malloc(bytes);
// Initialize vectors
for (int i = 0; i < n; i++) {
h_a[i] = i * 1.0f;
h_b[i] = i * 2.0f;
}
// Allocate device memory
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
// Copy data to GPU
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
// Launch kernel: 4 blocks × 256 threads = 1024 threads
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);
// Copy result back to CPU
cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
// Verify
printf("c[0] = %f, c[1023] = %f\n", h_c[0], h_c[1023]);
// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
free(h_a); free(h_b); free(h_c);
return 0;
}Key Takeaways from this Example
- The boundary check
if (i < n)is critical — we may launch more threads than data elements. - The formula
(n + threadsPerBlock - 1) / threadsPerBlockrounds up the number of blocks so every element is covered. - Memory must be explicitly allocated on the GPU with
cudaMallocand copied withcudaMemcpy.
Choosing Block Size
Block size (threads per block) is a performance-critical parameter:
- Must be a multiple of 32 (the warp size). Common choices: 128, 256, 512.
- Maximum is typically 1024 threads per block.
- Larger blocks use more shared memory and registers per block, which can limit occupancy.
Start with 256 threads per block. Profile and adjust from there.
Multi-Dimensional Grids and Blocks
For 2D problems (like image processing), you can use 2D blocks and grids:
dim3 threadsPerBlock(16, 16); // 16×16 = 256 threads
dim3 blocksPerGrid(
(width + 15) / 16,
(height + 15) / 16
);
myKernel<<<blocksPerGrid, threadsPerBlock>>>(...);
// Inside the kernel:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;Summary
- CUDA threads are organized into Blocks inside a Grid
- Use
blockIdx,threadIdx, andblockDimto compute each thread's unique work index __global__functions (kernels) run on the GPU and are launched from the CPU- Always check array bounds inside kernels to avoid out-of-range writes