Kernels & Execution

Chapter 4 — Writing, launching, and debugging CUDA kernels

Anatomy of a Kernel Launch

When you call a kernel with the <<<>>> syntax, you specify the execution configuration — how many blocks and threads to use:

cuda
12345678
// Syntax:
kernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(args...);

// Most common usage:
kernel<<<numBlocks, threadsPerBlock>>>(args...);

// With dynamic shared memory:
kernel<<<numBlocks, threadsPerBlock, sharedSize>>>(args...);

The first two arguments are required. The third (dynamic shared memory size) and fourth (CUDA stream) are optional.

Execution Configuration Best Practices

The execution configuration directly impacts performance:

launch_config.cu
1234567
int N = 1000000;
int threadsPerBlock = 256;

// Round up to cover all elements
int numBlocks = (N + threadsPerBlock - 1) / threadsPerBlock;

myKernel<<<numBlocks, threadsPerBlock>>>(d_data, N);
The rounding formula

(N + T - 1) / T is the standard way to compute the number of blocks. For N=1000 and T=256, this gives 4 blocks = 1024 threads (24 threads idle, but the boundary check if (i < N) prevents invalid access).

Error Handling

CUDA functions return error codes, but kernel launches do not return errors immediately. You must check explicitly:

error_handling.cu
123456789101112131415161718192021222324
// Check CUDA API calls
cudaError_t err = cudaMalloc(&d_data, bytes);
if (err != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed: %s\n",
            cudaGetErrorString(err));
    exit(1);
}

// Check kernel launch errors
myKernel<<<blocks, threads>>>(args);

// Check for launch errors (wrong config, etc.)
err = cudaGetLastError();
if (err != cudaSuccess) {
    fprintf(stderr, "Kernel launch failed: %s\n",
            cudaGetErrorString(err));
}

// Check for execution errors (detected on sync)
err = cudaDeviceSynchronize();
if (err != cudaSuccess) {
    fprintf(stderr, "Kernel execution failed: %s\n",
            cudaGetErrorString(err));
}

A common pattern is a helper macro:

cuda
123456789101112
#define CUDA_CHECK(call) do { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(1); \
    } \
} while(0)

// Usage:
CUDA_CHECK(cudaMalloc(&d_data, bytes));
CUDA_CHECK(cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice));

Kernel Limitations

Keep these constraints in mind when writing kernels:

  • No recursion on older architectures (compute capability < 2.0); limited stack on newer ones
  • No dynamic memory allocation inside kernels (except malloc on newer GPUs with limited heap)
  • printf works from kernels but uses a limited buffer — great for debugging, not for production
  • Kernel calls are asynchronous — the CPU continues immediately after launch

Device Query

You can query GPU capabilities at runtime. This is useful for choosing optimal kernel parameters:

device_query.cu
12345678910111213141516171819
int deviceCount;
cudaGetDeviceCount(&deviceCount);

for (int i = 0; i < deviceCount; i++) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, i);

    printf("Device %d: %s\n", i, prop.name);
    printf("  Compute capability: %d.%d\n",
           prop.major, prop.minor);
    printf("  SM count: %d\n",
           prop.multiProcessorCount);
    printf("  Max threads/block: %d\n",
           prop.maxThreadsPerBlock);
    printf("  Shared memory/block: %zu bytes\n",
           prop.sharedMemPerBlock);
    printf("  Global memory: %.1f GB\n",
           prop.totalGlobalMem / 1e9);
}

Summary

  • Use <<<gridDim, blockDim>>> to configure kernel launches
  • Always round up block count with (N + T - 1) / T and use boundary checks
  • Check errors with cudaGetLastError() and cudaDeviceSynchronize()
  • Kernel launches are asynchronous — the CPU doesn't wait unless you synchronize
  • Query device properties to write adaptive, portable code