Kernels & Execution
Chapter 4 — Writing, launching, and debugging CUDA kernels
Anatomy of a Kernel Launch
When you call a kernel with the <<<>>> syntax, you specify the execution configuration — how many blocks and threads to use:
// Syntax: kernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(args...); // Most common usage: kernel<<<numBlocks, threadsPerBlock>>>(args...); // With dynamic shared memory: kernel<<<numBlocks, threadsPerBlock, sharedSize>>>(args...);
The first two arguments are required. The third (dynamic shared memory size) and fourth (CUDA stream) are optional.
Execution Configuration Best Practices
The execution configuration directly impacts performance:
int N = 1000000; int threadsPerBlock = 256; // Round up to cover all elements int numBlocks = (N + threadsPerBlock - 1) / threadsPerBlock; myKernel<<<numBlocks, threadsPerBlock>>>(d_data, N);
(N + T - 1) / T is the standard way to compute the number of blocks. For N=1000 and T=256, this gives 4 blocks = 1024 threads (24 threads idle, but the boundary check if (i < N) prevents invalid access).
Error Handling
CUDA functions return error codes, but kernel launches do not return errors immediately. You must check explicitly:
// Check CUDA API calls
cudaError_t err = cudaMalloc(&d_data, bytes);
if (err != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed: %s\n",
cudaGetErrorString(err));
exit(1);
}
// Check kernel launch errors
myKernel<<<blocks, threads>>>(args);
// Check for launch errors (wrong config, etc.)
err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "Kernel launch failed: %s\n",
cudaGetErrorString(err));
}
// Check for execution errors (detected on sync)
err = cudaDeviceSynchronize();
if (err != cudaSuccess) {
fprintf(stderr, "Kernel execution failed: %s\n",
cudaGetErrorString(err));
}A common pattern is a helper macro:
#define CUDA_CHECK(call) do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(1); \
} \
} while(0)
// Usage:
CUDA_CHECK(cudaMalloc(&d_data, bytes));
CUDA_CHECK(cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice));Kernel Limitations
Keep these constraints in mind when writing kernels:
- No recursion on older architectures (compute capability < 2.0); limited stack on newer ones
- No dynamic memory allocation inside kernels (except
mallocon newer GPUs with limited heap) printfworks from kernels but uses a limited buffer — great for debugging, not for production- Kernel calls are asynchronous — the CPU continues immediately after launch
Device Query
You can query GPU capabilities at runtime. This is useful for choosing optimal kernel parameters:
int deviceCount;
cudaGetDeviceCount(&deviceCount);
for (int i = 0; i < deviceCount; i++) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
printf("Device %d: %s\n", i, prop.name);
printf(" Compute capability: %d.%d\n",
prop.major, prop.minor);
printf(" SM count: %d\n",
prop.multiProcessorCount);
printf(" Max threads/block: %d\n",
prop.maxThreadsPerBlock);
printf(" Shared memory/block: %zu bytes\n",
prop.sharedMemPerBlock);
printf(" Global memory: %.1f GB\n",
prop.totalGlobalMem / 1e9);
}Summary
- Use
<<<gridDim, blockDim>>>to configure kernel launches - Always round up block count with
(N + T - 1) / Tand use boundary checks - Check errors with
cudaGetLastError()andcudaDeviceSynchronize() - Kernel launches are asynchronous — the CPU doesn't wait unless you synchronize
- Query device properties to write adaptive, portable code