Introduction to CUDA
Chapter 1 — Understand GPU computing and why CUDA exists
What is GPU Computing?
A GPU (Graphics Processing Unit) was originally designed to render pixels on screen. Over time, engineers discovered that the massively parallel architecture of GPUs could accelerate many workloads beyond graphics — scientific simulations, machine learning, image processing, and more.
While a typical CPU has 8 to 64 powerful cores optimized for sequential tasks, a modern GPU has thousands of smaller cores designed to execute the same operation on many data points simultaneously. This is called data parallelism.
CPUs are like a team of expert workers — each can handle complex tasks independently. GPUs are like an army of simple workers — each does one small job, but together they finish massive workloads incredibly fast.
What is CUDA?
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA. It lets you write C/C++ code that runs on NVIDIA GPUs.
With CUDA you can:
- Write functions (called kernels) that execute on the GPU
- Move data between CPU (host) memory and GPU (device) memory
- Coordinate thousands of parallel threads
Your First CUDA Program
Let's look at the simplest possible CUDA program — a "Hello from GPU" kernel. Don't worry about understanding every line yet; we'll cover each concept in depth in the following chapters.
#include <stdio.h>
// A kernel function — runs on the GPU
__global__ void helloKernel() {
printf("Hello from GPU thread %d!\n", threadIdx.x);
}
int main() {
// Launch the kernel with 1 block of 8 threads
helloKernel<<<1, 8>>>();
// Wait for GPU to finish
cudaDeviceSynchronize();
printf("Done!\n");
return 0;
}Breaking it down
__global__— tells the compiler this function runs on the GPU and is called from the CPU.<<<1, 8>>>— the execution configuration: launch 1 block containing 8 threads.threadIdx.x— a built-in variable that gives each thread its unique ID within its block.cudaDeviceSynchronize()— waits for all GPU threads to finish before the CPU continues.
CPU vs GPU: When to Use CUDA
CUDA shines when your problem can be broken into many independent, similar tasks:
- Matrix math — linear algebra, neural network layers
- Image processing — filters, transforms applied to every pixel
- Simulations — physics, molecular dynamics, Monte Carlo
- Data processing — parallel reductions, sorting, searching
If your workload is inherently sequential (e.g., parsing a complex recursive data structure) or involves very little data, the overhead of moving data to the GPU may outweigh the performance gain.
Setting Up CUDA
To write and compile CUDA programs locally, you need:
- An NVIDIA GPU (GeForce, Quadro, Tesla, or datacenter GPUs)
- The NVIDIA CUDA Toolkit — includes the
nvcccompiler, runtime libraries, and tools - A C/C++ compiler (
gcc,MSVC, orclang)
Download the CUDA Toolkit from the official NVIDIA Developer site. Or — once our browser playground is live — you'll be able to write and run CUDA code right here!
# Compile a .cu file with nvcc nvcc hello.cu -o hello # Run the program ./hello # Output: # Hello from GPU thread 0! # Hello from GPU thread 1! # ... # Hello from GPU thread 7! # Done!
Summary
- GPUs have thousands of cores designed for parallel execution
- CUDA lets you write C/C++ code that runs on NVIDIA GPUs
- Kernel functions use
__global__and are launched with<<<blocks, threads>>> - Use CUDA when your problem is data-parallel and compute-heavy