Synchronization
Chapter 5 — Thread synchronization, barriers, and atomic operations
Why Synchronization Matters
When thousands of threads run in parallel, you often need to coordinate them. Without proper synchronization, threads can read stale data or produce non-deterministic results (race conditions).
Block-Level Synchronization: __syncthreads()
__syncthreads() is a barrier — all threads in a block must reach it before any thread can proceed past it. Use it whenever threads in the same block need to share data via shared memory.
__global__ void blockSum(float *input, float *output, int n) {
__shared__ float sdata[256];
int tid = threadIdx.x;
int gid = blockIdx.x * blockDim.x + threadIdx.x;
// Load into shared memory
sdata[tid] = (gid < n) ? input[gid] : 0.0f;
__syncthreads();
// Parallel reduction in shared memory
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
if (tid < stride) {
sdata[tid] += sdata[tid + stride];
}
__syncthreads(); // sync after each step!
}
// Thread 0 writes the block's sum
if (tid == 0) {
output[blockIdx.x] = sdata[0];
}
}Never place __syncthreads() inside a conditional branch where some threads in the block won't reach it. All threads in the block must execute the same __syncthreads() call, or the program will deadlock.
Warp-Level Execution
A warp is a group of 32 threads that execute instructions in lockstep (SIMT — Single Instruction, Multiple Threads). Threads within the same warp are inherently synchronized — they execute the same instruction at the same time.
Modern CUDA (compute capability 7.0+) provides warp-level primitives:
// Warp shuffle — exchange data between threads in a warp
// No shared memory needed!
float val = /* thread's value */;
// Get the value from the thread 1 position to the left
float neighbor = __shfl_up_sync(0xFFFFFFFF, val, 1);
// Broadcast lane 0's value to all threads in the warp
float broadcast = __shfl_sync(0xFFFFFFFF, val, 0);
// Sum reduction within a warp (no shared memory!)
for (int offset = 16; offset > 0; offset >>= 1) {
val += __shfl_down_sync(0xFFFFFFFF, val, offset);
}
// After this loop, thread 0 holds the sumThe first argument to __shfl_*_sync is a 32-bit mask indicating which threads participate. 0xFFFFFFFF means all 32 threads in the warp.
Atomic Operations
When multiple threads need to update the same memory location, use atomic operations to prevent race conditions:
__global__ void histogram(int *data, int *bins, int n) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
if (gid < n) {
int bin = data[gid];
atomicAdd(&bins[bin], 1); // safe concurrent increment
}
}
// Available atomic operations:
// atomicAdd, atomicSub, atomicMin, atomicMax
// atomicAnd, atomicOr, atomicXor
// atomicExch — swap
// atomicCAS — compare-and-swap (most flexible)Atomic operations serialize access, creating contention. To build a fast histogram, first accumulate in shared memory per-block, then use one atomic per block to update global memory.
Cooperative Groups (Modern CUDA)
CUDA 9+ introduced Cooperative Groups — a flexible API for synchronization at various granularities:
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void kernel() {
// Get the thread block group
cg::thread_block block = cg::this_thread_block();
block.sync(); // equiv to __syncthreads()
// Get this thread's warp
cg::coalesced_group active = cg::coalesced_threads();
// Get a tile of 16 threads within a warp
cg::thread_block_tile<16> tile =
cg::tiled_partition<16>(block);
tile.sync();
}Summary
__syncthreads()synchronizes all threads within a block — required when sharing data via shared memory- Warps (32 threads) execute in lockstep; warp shuffle functions exchange data without shared memory
- Atomic operations (
atomicAdd,atomicCAS, etc.) prevent race conditions on shared data - Cooperative Groups provide modern, flexible synchronization primitives