Introduction to CUDA

Chapter 1 — Understand GPU computing and why CUDA exists

What is GPU Computing?

A GPU (Graphics Processing Unit) was originally designed to render pixels on screen. Over time, engineers discovered that the massively parallel architecture of GPUs could accelerate many workloads beyond graphics — scientific simulations, machine learning, image processing, and more.

While a typical CPU has 8 to 64 powerful cores optimized for sequential tasks, a modern GPU has thousands of smaller cores designed to execute the same operation on many data points simultaneously. This is called data parallelism.

Key Insight

CPUs are like a team of expert workers — each can handle complex tasks independently. GPUs are like an army of simple workers — each does one small job, but together they finish massive workloads incredibly fast.

What is CUDA?

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA. It lets you write C/C++ code that runs on NVIDIA GPUs.

With CUDA you can:

  • Write functions (called kernels) that execute on the GPU
  • Move data between CPU (host) memory and GPU (device) memory
  • Coordinate thousands of parallel threads

Your First CUDA Program

Let's look at the simplest possible CUDA program — a "Hello from GPU" kernel. Don't worry about understanding every line yet; we'll cover each concept in depth in the following chapters.

hello.cu
1234567891011121314151617
#include <stdio.h>

// A kernel function — runs on the GPU
__global__ void helloKernel() {
    printf("Hello from GPU thread %d!\n", threadIdx.x);
}

int main() {
    // Launch the kernel with 1 block of 8 threads
    helloKernel<<<1, 8>>>();

    // Wait for GPU to finish
    cudaDeviceSynchronize();

    printf("Done!\n");
    return 0;
}

Breaking it down

  • __global__ — tells the compiler this function runs on the GPU and is called from the CPU.
  • <<<1, 8>>> — the execution configuration: launch 1 block containing 8 threads.
  • threadIdx.x — a built-in variable that gives each thread its unique ID within its block.
  • cudaDeviceSynchronize() — waits for all GPU threads to finish before the CPU continues.

CPU vs GPU: When to Use CUDA

CUDA shines when your problem can be broken into many independent, similar tasks:

  • Matrix math — linear algebra, neural network layers
  • Image processing — filters, transforms applied to every pixel
  • Simulations — physics, molecular dynamics, Monte Carlo
  • Data processing — parallel reductions, sorting, searching
When NOT to use CUDA

If your workload is inherently sequential (e.g., parsing a complex recursive data structure) or involves very little data, the overhead of moving data to the GPU may outweigh the performance gain.

Setting Up CUDA

To write and compile CUDA programs locally, you need:

  1. An NVIDIA GPU (GeForce, Quadro, Tesla, or datacenter GPUs)
  2. The NVIDIA CUDA Toolkit — includes the nvcc compiler, runtime libraries, and tools
  3. A C/C++ compiler (gcc, MSVC, or clang)

Download the CUDA Toolkit from the official NVIDIA Developer site. Or — once our browser playground is live — you'll be able to write and run CUDA code right here!

terminal
1234567891011
# Compile a .cu file with nvcc
nvcc hello.cu -o hello

# Run the program
./hello
# Output:
# Hello from GPU thread 0!
# Hello from GPU thread 1!
# ...
# Hello from GPU thread 7!
# Done!

Summary

  • GPUs have thousands of cores designed for parallel execution
  • CUDA lets you write C/C++ code that runs on NVIDIA GPUs
  • Kernel functions use __global__ and are launched with <<<blocks, threads>>>
  • Use CUDA when your problem is data-parallel and compute-heavy